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Abstract 



A Bayesian approach to the classification problem is proposed in which random 
partitions play a central role. It is argued that the partitioning approach has 
the capacity to take advantage of a variety of large-scale spatial structures, 
if they are present in the unknown regression function /q. An idealized one- 
dimensional problem is considered in detail. The proposed nonparametric prior 
uses random split points to partition the unit interval into a random number 
of pieces. This prior is found to provide a consistent estimate of the regression 
function in the C topology, for any 1 < p < oo, and for arbitrary measurable 
/o : [0, 1] [0, 1]. A Markov chain Monte Carlo (MCMC) implementation is 
outlined and analyzed. Simulation experiments are conducted to show that 
the proposed estimate compares favorably with a variety of conventional esti- 
mators. A striking resemblance between the posterior mean estimate and the 
bagged CART estimate is noted and discussed. For higher dimensions, a gen- 
eralized prior is introduced which employs a random Voronoi partition of the 
covariate-space. The resulting estimate displays promise on a two-dimensional 
problem, and extends with a minimum of additional computational effort to 
arbitrary metric spaces. 
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Chapter 1 
Introduction 



The binary classification problem is perhaps the simplest regression problem, 
but it continues to pose fresh challenges. In the binary classification problem, 
we are given a list of n pairs Zi = (Xj, Yi) each pair drawn independently 
from an unknown probability measure F. The X's play the role of covariate or 
"predictor" and lie in some abstract space X, while the y s are interpreted as a 
class label and are either or 1. Our goal is to estimate certain functionals of F. 
Specifically, in the binary regression problem, we are interested in estimating 
the regression function / : i— > [0, 1]. The value of / at a given point x & X 
is the conditional probability that Y = 1 given that X = x. In this way we 
model the joint distribution F by saying that to draw an {X, Y) pair from F, 
first draw a covariate X = x from the marginal distribution of X denoted by 
/i. Then "flip" an /(x) coin to determine the value of Y. 

In the classification problem, we are concerned with being able to predict 
future Y values. The standard formalization of this task is that we wish to 
choose the "decision rule" that will minimize the expected loss incurred; this 
reduces to the problem of estimating the set {x & X : f{x) > c}, for some c 
that depends upon the loss (for simplicity, ignore the possibility that c depends 
on x) . There are a great many ways to proceed on each of these problems, as 
demonstrated by the vast literature on these subjects. Some references are 
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given in lsection 2.31 

In this thesis, I propose a nonparametric Bayesian approach to the binary 
classification and regression problems. Specifically, to derive an estimator, I 
regard F itself as random. For simplicity, I regard the marginal distribution 
of X, as known. In this case, putting a prior distribution on F amounts to 
putting a prior on /. More generally, one can also put a prior on functions m 
and suppose that fi{dx) = m{x)fio{dx). Some sort of mild restriction, like this 
one that all fi share some dominating measure /iq, is useful to avoid technical 
problems in defining the conditional distribution of F given the data. 

Let TT denote a prior distribution on F, or, more precisely, on (/, m) pairs. 
Extend vr to a joint distribution on F = (/, m) and the infinite data sequence 
{Zi, Z2, . . . ) which, conditionally on F, is drawn independently and identically 
distributed {iid) from F. Formally, the posterior is the measure 7Tn{dF) := 
TT {dF\Zi = zi, . . . , Zn = Zn). In practice Markov chain Monte Carlo procedures 
can be used to generate a sample from the posterior. 

The posterior mean of / is an important summary of the posterior: its 
value minimizes the posterior risk under an loss. Let / denote the posterior 
mean: f{x) = J f {x)7in{df) . Another important summary of the posterior is 
the classification rule which minimizes posterior misclassification loss. If asked 
to predict the most likely value of the Y's corresponding to Xn+i, . . . , X„+„/ 
all at once, the decision that would minimize the posterior-expected 0-1-loss 
is simply (5j = 1 c-, , Interestingly, though, if asked sequentially instead 



of all at once, it is necessary to update / with each new data point before 
deciding. 

Taking this Bayesian approach assures us that the resulting estimators will 
have a clear subjective interpretation. In addition, if the prior n is carefully 
chosen, the resulting estimators, chosen indirectly through this Bayesian frame- 
work, may have frequentist advantages over the estimators that might other- 
wise be proposed. For example, interesting kinds of shrinkage and averaging 
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occur automatically within this framework. Subsequent chapters assess the 
frequentist performance of these Bayesian estimates by simulation experiments 
( |chapter"4l ) and theoretically ( [chapter's ). 



1.1 An Example 

To get started, let us consider the specific case in which X = [0, 1] and the 
sampling distribution of the Xi, fi, is known to be the U{0, 1) distribution. 
Further, let us consider a specific / = /o which is complicated enough that 
its estimation should not be too easy for any of the standard methods; it is 
piecewise continuous with two constant regions and a smooth transition region. 
As shown in Figure 1.1. a[ fo{x) is chosen as: 



/o(x) 



< X < i 0.6 
l<x<l 0.4 

1 < X < 1 "^^^"-^^ 



where (p^- is the density of a normal with mean and standard deviation a 
0.25. 



The two histograms in Figure 1.1. a summarize a simulated data set of 1024 



data points that was drawn from this model. The green histogram is a his- 
togram of the heads. The red histogram is a histogram of the tails; it is drawn 
upside down to facilitate comparison with the green histogram. To more easily 
interpret this display, notice that if we take the sum of the corresponding green 
and red bins at each point, we recover a histogram of the marginal distribu- 
tion, which is uniform. Furthermore, the ratio of the height of a green bin to 
the corresponding red bin represents the empirical odds of a head in that bin. 
Near x = 0.5, for example, notice the sharp transition from nearly equal green 
and red bins (on the left) to much longer green than red bins (on the right). 
The performance of this posterior mean estimator is compared with a variety 
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Figure 1.1. a: An Example: /o(a;) 



of more conventional methods in chapter 4 This example is used to illustrate 
the present approach in the rest of this introduction. 



1.2 A Nonparametric Prior on Regression Func- 
tions 

To specify a prior vr over functions / : [0, 1] t-^ [0, 1], I explain how to choose 
/, at random from it. This completely specifies a prior on the probability 
distribution F, since I consider the marginal distribution fi to be known. The 
prior on / will concentrate on locally-constant step functions. To choose a step 
function at random, first, choose K, the number of locally constant intervals, 
where: 

P(K = k) = {l- a)a''~^ for A; = 1 . . . cx) 

That is, K is Geometric with parameter 1 — a. Ultimately, the choice of 
a must be specified by the user. For the examples in this thesis, I have used 
a = I unless otherwise noted. This choice seems to perform well. For further 
discussion of how to choose a prior on K which results in provably consistent 
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Figure 1.3. a: 



estimators, see [chapter 6 



Now, conditional on K = k, choose Vi, V2, . . . , Vk-i iid from f/(0, 1). Let 
V(j) (for i = 1 . . .k — 1) he the i'th ordered value of the y s. This produces k 
intervals: 

/l = {0 < X < V(i)}, h = {V^l) <X <V^2)},..-,Ik = < x < 1} 

If /c = 1, simply take /i = [0, 1]. Finally, conditional on K = k, choose k values 
Si (for i = 1 . . .k) iid from U{0, 1). This generates the random function /: 

k 

fix) = 5^543; e /, 

1=1 



1.3 Sample Results 

Conditioning this prior on the data described in the earlier example section 
results in a posterior distribution on regression functions /. Applying the 
techniques in chapter 3 to sample from the posterior results in a long list of 
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sampled functions. Although the prior specified a Geometric{\) prior distribu- 
tion on the number of locally-constant pieces K in each function, when drawn 
from the posterior, functions typically have at least 5 pieces. Taking the av- 
erage of these functions (at every x) gives an estimate of the posterior mean; 
this is illustrated in Figure 1.3. a Further discussion of results like this can be 
found in [chapter 4 



1.4 The Partitioning Approach 

Step functions, such as the functions / that the prior tt concentrates on, have 
the advantage of great mathematical simplicity; but, if specified in sufficient 
detail, they can approximate general functions. For a multi-dimensional regres- 
sion function, it is natural to generalize this idea by considering some partition 
of the covariate space X into a number of pieces and constructing a function 
that takes a different value on each piece. A procedure that uses a wide variety 
of geometric shapes to partition the space might be able to find a partition 
with the right structure to approximate the unknown regression function. Ide- 
ally, the partition would be no more complex than necessary to achieve a good 
approximation. If the unknown regression function has certain global features 
that the chosen partition can be adapted to, this idea becomes very powerful. 
Instead of merely "borrowing strength" locally, like ordinary smoothing estima- 
tors do, a partition-based estimator borrows strength across the whole range 
of a partition element. As a simple example, if the true regression function 
does not depend on one of the covariates, the partition elements do not need 
to break up space along this dimension at all; this results in larger partition 
elements and more efficient estimation of the success probability on each of the 
pieces. Similarly, if the regression function is almost fiat in some large chunk 
of space, then this whole region can become a single element. If the level-sets 
of the regression function have smooth boundaries, perhaps the partition ele- 
ments can be chosen to follow these contours. Finally, since the best partition 
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for the unknown function is unknown, it makes sense to average together the 
approximations found by a variety of partitions that make the data hkely. This 
is exactly what the posterior mean estimate will do automatically. Chapter [7| 
shows one way in which this idea can be applied using Voronoi partitions. For 
this partition, a prior distributes seed-points in the covariate space; each seed 
corresponds to a partition element, the one that consists of all points closer to 
that seed than any other seed. By placing the seeds appropriately, the partition 
can be fine or coarse as needed. The boundary between partition elements is 
itself a hyperplane whose position and orientation can be controlled through 
the placement of the seeds. 

Other authors have recently considered similar priors with good success. 
Their work is discussed in lsection 2.21 To the best of my knowledge, this thesis 
presents the first theoretical examination of consistency issues for priors of this 
sort (c.f. [chapter 3 with discussion in [chapter 6 ) . Additionally, I present a 
detailed assessment of the empirical performance of my methods on certain 
novel simulation experiments. The comparison with bagged CART regression 
trees in 



chapter 4 is especially interesting. 



1.5 Outline 

Chapter [21 gives a literature review. Chapter [S| shows how to (approximately) 
compute samples from the posterior and the posterior mean. Chapter [^ trys 
out the method on examples and carefully compares its performance with that 
of a variety of existing methods. Chapter [3| gives sufficient conditions on the 
prior under which it provides universally consistent estimates of /. Chapter [7[ 
describes a different prior which extends these ideas to general metric spaces 
by employing random Voronoi partitions. Certain modifications are explained 
that make the proposal more practical and its performance on an example is 
shown. Finally, the afterword gives a philosophical argument that advocates 
the use of Kolmogorov- complexity in future statistical thinking. 



Chapter 2 



Literature 



This chapter reviews and discusses the hterature on three subjects. The first 
section reviews some theoretical results concerning the frequentist performance 
of Bayesian procedures. The second section gives a survey of some of the work 
done by authors on related Bayesian efforts. The final section briefly surveys 
some salient examples of alternative approaches to the classification problem. 



2.1 Theoretical Results 

The frequentist performance of Bayesian methods is of fundamental interest 
in statistics. Given a large sample from a smooth, finite-dimensional statis- 
tical model, the situation is quite well understood. The Bernstein-von Mises 
theorem 



|49|,|33i shows that the Bayes estimate and the maximum likelihood 
estimate will be close. Furthermore, the posterior distribution of the param- 
eter vector around the posterior mean is close to the distribution of the max- 
imum likelihood estimate around the truth: both are asymptotically normal 
with mean and the same covariance matrix. Unfortunately, though, in more 
general circumstances, such as those needed for this work, the situation can 
be much more complex. In particular, the basic model is based on an infinite 
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hierarchy of finite dimensional models. Moreover, even for a given finite dimen- 
sional submodel, the dependency of the likelihood function on the parameter 
is not smooth; the functions are allowed to take jumps. Consequently, a more 
general theory is needed. 

This section reviews some of the literature on this subject with a focus on 
results that address the question of consistency: i.e. as the number of data 
points tends to infinity, will the Bayesian estimate converge to the true value 
(in some suitable sense) almost surely (resp. in probability)? The literature 
contains a number of useful and quite fiexible positive results, but also a va- 
riety of interesting negative examples showing that the regularity conditions 
under which the theorems hold are not to be taken lightly. A good intro- 
duction to these issues is by Diaconis and Freedman [23[. Throughout this 
section, the reader may envision a family Pg{dx), a prior 7r{d6), and posterior 
7r(0|xi, . . . ,Xn), where the Xi are drawn iid from Pg^^dx). Consistency means 
that the posterior concentrates at for large samples. 

Doob established a fundamental result under minimal regularity as- 
sumptions using a martingale convergence argument. Roughly speaking, the 
result states that if consistent estimators exist at all, then a Bayes procedure 
will provide an almost surely consistent estimate of the true parameter 6 under 
sampling from the 6 distribution for any 6 in some set B which has prior prob- 
ability of 1. Notice, though, that this does not specify if consistency will obtain 
at any particular point of interest Oq, unless 60 happens to be a point-mass of 
the prior, or unless it possible to determine B by some more detailed line of 
argumentation. 

rVeedman fl — ed the ca.e in which the observations a.e discrete. 

If the set of possible observations is finite, the posterior is consistent exactly 
for parameter values in the topological support of the prior. The countably 
infinite case is more complex. He constructs a class of examples showing that it 
is possible to construct a prior which assigns positive mass to every (weak star) 
neighborhood of the true parameter value, but for which the posterior converges 



CHAPTER 2. LITERATURE 



10 



to a point mass at some other (chosen) parameter value. Furthermore, he 
finds a prior which assigns positive prior mass to every (weak star) open set of 
parameters, but for which the posterior is consistent only at a set of parameters 
of the first category. The reader should note that this prior did not assign 
mass to all entropy-neighborhoods. This sort of subtle distinction can make 
all of the difference and explains the necessity of some such assumption in the 
following consistency theorems. He introduces the "tail-free" priors for the the 
countably- infinite case and demonstrates that these are always consistent. 
Lorraine Schwartz jo^] explored the question of consistency in a very gen 



61 



eral setting. She extended Doob's result to a broad class of loss functions 
lemma 4.2]. She also found sufficient conditions for the posterior to be con- 
sistent under iid sampling. These conditions, she says, are "of an essentially 
weaker nature" than the conditions established for the consistency of maxi- 
mum likelihood estimators. Nevertheless, she constructs an example where the 
maximum likelihood estimate is consistent and the estimates based on certain 



priors are not. The example ([6l|, example 3]) involves a simple parametric 
family of densities which satisfies Wald's conditions, thereby guaranteeing that 
the maximum likelihood estimate will be consistent, but for which the posterior 
can be inconsistent. The consistency of the posterior in this case, is found to 
depend critically on the amount of mass that the prior ascribes to small neigh- 
borhoods of the true parameter value; if this mass shrinks too quickly, the 
prior "ignores" the data. One clever aspect of her construction is the way the 
densities are parametrized. Parameter values close to the target value Oq corre- 
spond to densities that are close to the 6'o-density in an sense, but which are 
farther and farther away in Kullback-Leibler discrepancy. In fact, there is only 
one point in parameter space (the true parameter) that has Kullback-Leibler 
discrepancy from the truth smaller than e, for e sufficiently small. 

Schwartz then shows that the posterior will be consistent under iid sampling 
under two basic conditions. First, the prior should have positive mass on 
Kullback-Leibler neighborhoods of the true parameter (defined in Isection 5!T] 
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of this thesis) , and second, the model class should not be too rich; specifically, 
she requires that uniformly consistent tests of the hypothesis that 9 = 9q 
against the alternative that 9 lies outside a given (open) neighborhood of 9q 
exist . 

It is not always obvious how to verify the later property directly. Mod- 
ern authors have employed entropy-type bounds to guarantee their existence. 
Ghoshal, Ghosh, and van der Vaart j40| state a theorem ( 40|, theorem 7.3]) 
which proves that the posterior converges at a certain rate if certain uniform 
tests exist (and the prior mass is suitably distributed) and go on to find a 
variety of entropy-type conditions that suffice to be able to construct the nec- 
essary tests. Shen and Wasserman 6^ show related results, requiring slightly 
different conditions on how mass needs to be allocated-they do not a make a 
connection with testing. Barron, Schervish, and Wasserman 0] find sufficient 
conditions for the posterior to be consistent; their results are reviewed and 



then used in chapter 5 



It should be noted that these various conditions for consistency are not 
necessary, but merely sufficient. Nevertheless, it is important to treat this 
subject with care because of the variety of examples for which consistency 
fails. 

Barron, Schervish, and Wasserman also give an interesting example where 
consistency fails. In this example, they show that the prior puts too much mass 
on a very rich class of models that will be able to match any spurious structure 
that the data might have by chance, overwhelming the true parameter. Fur- 
thermore, lest the reader get the wrong idea, inconsistency does not only occur 
in artificial examples. A series of "natural" yet still inconsistent estimators for 
the symmetric location problem are discussed by Diaconis and Freedman [2^. 
In addition, the binary regression example explained in the next section has a 
natural motivation based on conditional exchangability. 
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2.2 Related Bayesian Work 

The following subsections contain a review of work by other authors that is 
closely related to this thesis. It is followed my a brief synopsis of the contribu- 
tions that this thesis makes to the literature. 



2.2.1 A Dyadic Prior for Binary Regression 

The most relevant examples for the work of this thesis are the nonparametric 
binary regression examples of Diaconis and Freedman 0, 2^. They use a 
different prior; call it vtdf; a hierarchical, dyadic prior on /. To describe tibf, 
let Ak be the set of intervals which result from partitioning the unit interval 
into 2'' equal pieces. Let JF^ be the subset of functions which are constant on all 
intervals a E Ak. Finally, fix a prior distribution k, on the non- negative integers. 
Assume, for simplicity, that K,{k) > for all k. To draw / from it-dp, draw K 
from K, and then, conditional on hierarchy-level K = k, draw / uniformly at 
random from Tk- In effect then, at level k one draws 2'^ independent f/(0, 1) 
random variables to describe the success probability on each of the 2^ pieces. 

They show that for any k, and any /q (except possibly for /q = ^), the 
posterior estimates are consistent (in the sense that any neighborhood of /o 
has posterior probability tending to 1 a.s.). Remarkably, however, for /g = |, 
the posterior can be an inconsistent estimate if the tail of k is sufficiently heavy. 
Specifically, let = — \og{K{K > k))/k. Then if limsup > \crit = 2^3: ^ 
0.841, the posterior is inconsistent at /o = \. On the other hand, if limsup \k < 
\crit, the posterior is consistent for any /q. To put this in perspective, for k(A;) = 
(1— (a shifted Geometrical — P) prior), limsup A^ = — log(/3). The critical 
value for P is exp(— Acrit) ~ 0.431; for larger P (longer tails) inconsistency will 
occur (but only for /q = |). 

This result is substantially stronger than the result I have obtained for my 
prior TT. In particular, applying the same (general) method of proof that I 
employed to prove consistency for tt to ttdf yields only the result that ttdf is 
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consistent if the tails of k drop off at least as fast as those of a Poisson. (Recall, 
that at level k, it only divides [0, 1] into k intervals, but ttdf divides it into 2'^.) 
Their method of proof is direct: using Bernstein's inequality, Poissonization, 
and special features of the prior. My method of proof is indirect; it uses general 
results that employ entropy-type bounds. 

There are striking similarities between vr and vtdf- In fact, vr is equivalent 
to a suitably randomized hbf- To achieve this, it is not enough to simply 
randomize the dyadic split points. Instead, recall that ttdf has an alternative 
interpretation in terms of binary sequences. At hierarchy-level k, ttdf is uni- 
form over jFfc. This corresponds to independently assigning uniform success 
probabilities to each binary sequence of length k. Here is an alternative way 
to draw / from tt. Draw g from ttdf and interpret g as function on binary 
sequences of length k {k depends on g). Let (z = 1, . . . , /c) be iid f/(0, 1) 
random variables. To any point u G [0, 1] associate the binary random variables 
r]i{u) = l{u < Vi) {i = l,...,k). Define / via f{u) = g{{r]i{u), . . . ,r]k{u))). 
Note that only a small fraction of possible binary sequences are realized in this 
manner (at level k (which ranges from to oo under ttdf); ^ + 1 sequences out 
of the full set of 2^ possible sequences are achieved). 



2.2.2 Bayesian CART 

Two other closely related priors can be described as Bayesian versions of the 
CART algorithm. This was pursued by Chipman, George, and McCuUoch, 
whose prior closely parallels the choices made in the original CART algo- 
rithm Here is a description of their prior when the covariate 
space is TZ^. Their prior starts with a root node (which represents the whole 
space); this node is then recursively partitioned in a random way. For each 
node, randomly choose whether to split it or not, then choose a coordinate 
to split on, then choosing a split point (i.e. the cutoff value) randomly from 
among the midpoints between the ordered values of this coordinate; finally each 
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leaf node is given an independent regression value. The details of how these 
decisions are made differ in their particulars from the ones that I described in 
the introduction. In early work, these authors observed that using MCMC to 
sample from the posterior of this prior provides a rudimentary (global) search 
procedure, which has certain (apparent) advantages over the greedy search pro- 
cedure commonly implemented in CART-type algorithms. In later work, they 
examined and computed the (approximate) posterior mean (working primar- 
ily on the least-squares white-noise regression problem) and found that it had 
good performance. They also considered extended priors that modeled the 
regression values as additively (not independently) generated 0]. 

Denison, Mai ick, and Smith, independently considered another version of 
Bayesian CART 18|, llll ll9| . For one- dimensional problems they propose using 
random splines (the prior I use is essentially a special case of this prior). They 
consider some of the regression examples that are standard in the wavelet lit- 
erature and show that their spline methods perform equally well. Additionally, 
they propose a Bayesian version of Friedman's MARS which puts a prior on 
functions that are constructed by adding together random spline-type ridge 
functions. Denison, Adams, Holmes, and Hand discuss the usefulness of ran- 
dom partitions in this paper ^ . 

Very recently, Denison, Holmes, Mallick, and Smith have written a book 
which surveys some related Bayesian regression schemes, including a Bayesian 
method for (multiple class) classification using Voronoi partitions that is very 
closely related (albeit independent of) the work that I present in chapter 7| The 
book also discusses Bayesian wavelet methods, and an interesting Bayesian 
nearest-neighbor prior. As a default prior, they recommend assuming that 
every model in a "single dimension" is equally likely, and each dimension is 
equally probable, a priori. This "flat prior," they claim, should serve per- 
fectly well because of the, "natural tendency" for the marginalized likelihood 
to penalize complex models: 
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On the face of it, we might be concerned that the flexible model- 
ing strategy we advocate might be prone to overfitting the data by 
adding too many basis functions. Indeed, many papers found in the 
literature advocate explicit priors on the model space that penalize 
the dimension of the model. However, throughout this book we ar- 
gue that such a measure is unnecessary. The Bayesian framework 
contains a natural penalty against over [sic] complex models, some- 
times called Occam's razor, which essentially states that a simpler 
theory is to be favoured over a more complex one, all other things 
being equal. 

There is no consideration given to the possibility that this might give rise to 
inconsistent estimates (e.g. as in the Diaconis and Freedman non-parametric 
regression example explained earlier); indeed there are few theoretical con- 
siderations at all in the book. Their explanation of why the Markov chain 
techniques that they develop should actually give meaningful samples from the 



posterior appeals to Green's reversible jump j41|. The explanation given is 
vague and ultimately they decide to avoid the issue and appeal to the fact 
that their chains are discrete. The chains in [chapter "3 of this thesis involve a 



continuous state space and do not simply avoid this issue by discretizing the 
continuous modeling space as these authors seem to do. 

Overall, the book emphasizes main ideas, algorithms, and results. It seems 
that for every existing regression technique, they want to demonstrate that 
they can make a "Bayesian" version of it too. The book does not emphasize 
subjectivism, but rather adopts an "A^open" perspective to Bayesian modeling: 
"we never believe that the true model lies in the set of possible models." The 
book does do a good job of supplying default priors for a wide variety of possible 
parametric models. Similarly, Denison's thesis emphasizes the wide variety 
of problems to which Bayesian partitioning methods of this sort can be applied. 
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2.2.3 Poisson Rate estimates using Random Partitions 

Green j4l|, and Scargle develop priors on piecewise constant functions 
on the real line and M'^ using Voronoi cells. Their priors are quite similar to 
the ones developed in this thesis, but are intended to address the problem of 
estimating the rate function of a Poisson process. In principle, one could apply 
their techniques to the problem of binary regression by generating an estimate 
of the rate function of the "heads" process and the "tails" process separately 
and then combining the results. I do not think that this has been tried and it 
seems substantially less "natural." 

Green applies his method to a coal mining dataset and a synthetic two- 
dimensional example. For these example. Green assumes that an individual 
cell's rate-parameter is drawn independently from a r(a, (3) prior. For the one- 
dimensional case he advocates a prior which "probabilistically" spaces out the 
change-point locations; specifically, if there are j change-points, the ordered 
locations of the change-points are distributed like the even order statistics of 
2j + 1 independent uniform values. He argues that this is good because it 
prevents small change-point intervals from entering into the posterior. For 
the two-dimensional example, the generating points of the Voronoi partition 
are drawn independently and uniformly. Green's methods are given, in part, 
as examples of his "reversible jump" MCMC technique. This technique has 
become an accepted part of MCMC practice, but is not accepted by all experts 
in MCMC theory because it does not lay down in a straightforward "theorem- 
proof" manner the necessary conditions and consequent conclusions. For this 
reason, detailed verifications for the chains used in this thesis are given in 
chapter 3 

Scargle's work is applied to astronomical data; he concentrates on the prob- 
lem of finding the mode of the posterior, rather than the posterior mean. For- 
tunately, he and coworkers have developed a way of computing this mode in 
the one-dimensional case exactly and efficiently using a dynamic programming 
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approach Instead of giving each cell an independent value, Scargle gives 
each cell a (logical) "color" and then associates each unique color with an inde- 
pendent rate-parameter. This allows him to use a fine partition and then group 
"chunks" back together into more complicated shapes. The way he forms this 
partition is also different; in particular his "prior" is data dependent, but not 
quite in the way of the "prior" that I consider in [chapter 7| Rather, the data 
is used once and for all to generate the fine Voronoi partition of space that 
results from using all of the data points as generators. These cells are then 
"clumped" (i.e. given a logical color) and the clumps are given an independent 
rate parameter. 



2.2.4 Bayesian "Image" Analysis 

M0ller and Skare [s^ apply their work to reservoir modeling and connect their 
work to efforts in Bayesian image analysis (including Markov random fields). 
They use a random Voronoi partition of the data and assign each partition 
element a random color (in a way that depends only the colors of neighbor- 
ing cells). They supply several further references to work in Bayesian image 
analysis which use Voronoi cells. From their perspective, to calculate their 
posterior they are simulating from a special "marked point" process. The gen- 
erators of the Voronoi cells are regarded as point set that has been drawn from 
a homogeneous Poisson process of rate (3 on the unit cube. In the simplest 
case, the marks or "colors" of these points are just integers from 1 up to M 
that have been drawn independently. More generally, according to their prior, 
the conditional distribution of the coloring of cells given is an Ising or Potts 
model. The graphical structure of this model is determined by consideration 
of which Voronoi cells are neighbors, and the 6 parameter is chosen to refiect 
their prior belief that neighboring cells tend to be of the same color. They 
consider two problems. The first is a simulation experiment in which a "true" 
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binary image is degraded with Gaussian noise. The second is a three dimen- 
sional reservoir problem based on real data. It is supposed that a certain three 
dimensional cube (the reservoir) consists of 4 different types of rock. The rock 
types are observed along seven vertical lines, representing the observations of 
rock that were made as seven wells were dug into the reservoir. In both prob- 
lems, the true object to be recovered is itself a certain "coloring" of space (i.e. 
rather than a continuous regression function). For the MCMC computation of 
their posterior they apply the birth-death type Metropolis-Hastings algorithm 
for point processes, as studied by Geyer and M0ller j38| and claim that their 
target distribution satisfies a local stability condition (see Geyer js^], Kendal 
and M0ller 0, and M0ller 0) so that the MCMC is actually geometrically 
ergodic. 

2.2.5 Polya Trees 



Finally, Polya trees |48[ and especially randomized Polya trees j55| deserve to 
be mentioned. The basic Polya tree puts a prior on distribution functions on 
the unit interval. The unit interval is divided recursively in a dyadic binary 
way and mass is allocated to each piece of the partition in a stagewise manner 
by first determining how much of the mass that is available will be on the left 
versus the right half and then continuing with such determinations layer by 
layer. Each of these assignments is ultimately determined by independent Beta 
random variable, whose parameters depend upon its location in the "tree." If 
a suitable choice of these parameters is made the result prior on distribution 
functions concentrates on distributions that are absolutely continuous with 
respect to Lebesgue measure. The essential advantage of Polya trees is that the 
posterior of Polya tree prior is easily and analytically computable, being itself 
another Polya tree. For randomized Polya trees, the partitioning scheme is 
independently "jittered" at random in a particular way jSSf- A Hybrid MCMC 
can be employed to sample from the randomized Polya tree posterior which 
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uses a Gibbs step to take advantage of the ease with which the (internal) Polya 
tree posterior can be computed. Both methods can be extended (essentially 
by taking "direct products" ) to put a prior on distributions on the unit cube. 



2.2.6 The Contributions of this Thesis 

Reviewing the depth and breath of the literature reviewed above may leave 
the reader in doubt about the contributions of this thesis. After all the one- 
dimensional prior that I consider is essentially a special case of the univariate 
spline model and the idea of using Voronoi partitions is certainly not new, 
although effective Bayesian methods using them only started springing up fairly 
recently. 

Still there is room for careful analysis. This thesis establishes that the 
posterior is consistent under suitable conditions on the prior and for any mea- 



surable regression function (see chapter 5 for details): an issue which none of 



the "Bayesian CART" or "Voronoi Partition" authors address at all. This the- 
sis also gives an explicit Markov chain Monte Carlo algorithm fsee lsection 3.4|) . 
Broadly speaking it is a fairly standard birth-death Markov chain as consid- 
ered by Geyer and MoUer j38|, but the technicalities of the analysis seem to 
be somewhat different. This thesis proceeds to show in detail that it satisfies 
detailed balance by direct self-contained argumentation; further, the chain is 
shown to have an ergodicity property fsee lsection 3.10|) . These considerations 
are often glossed over in modern writing. 

On the more practical side, chapter 4| scrutinizes the behavior of the pos- 



terior mean estimate under a variety of carefully designed simulation exper- 
iments. These experiments both serve to analyze the posterior mean and to 
give insight into the relationship between Bayesian methods and their clas- 
sical counterparts. See for example the discussion of CART and bagging in 
Isubsection 4.1.31 
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2.3 Other Approaches 

The literature on classification and regression methods is huge; the interested 
reader is urged to consult good modern books on the subject like The Elements 



of Statistical Learning, by Hastie, Tibshirani, and Friedman |43[. The following 
paragraphs outline some of the methods that have had the most impact upon 
the author. 

In the statistics literature, classical approaches to the classification and bi- 
nary regression problem include logistic regression, Fisher's discriminant anal- 
ysis, and projection pursuit methods. Logistic regression specifies that the 
success probability regression function is such that its log-odds follows a linear 
model with a user specified basis (e.g. by using polynomial or spline functions 
of the covariate-data) and estimates the parameters by maximum likelihood. 
Model selection is commonly performed using classical methods to select a sub- 
set of the covariate variables. Fisher's discriminant analysis finds a hyperplane 
which "optimally" separates the two classes using a within versus between 
variance criterion. Projection pursuit seeks an interesting linear (or sometimes 
nonlinear) projection of the covariate-data onto a lower dimensional subspace 
(e.g. M). Various criteria have been proposed to define "interesting," some of 
which are suitable for the classification problem. Each of these methods has 
undergone a variety of generalizations and tweaks to address a wider range of 
problems over the years. 

The first general method to solve the classification problem automatically 
was the A;- nearest neighbor approach fc-Nearest neighbor estimates are 

known to be universally consistent if k = k{n) | oo slowly enough 2]J. Their 
convergence, however, especially in high dimensional problems, can be slow in 
practice js^. 

Local regression methods are a clever extension of this approach. To predict 
at a given point, instead of averaging the values given at the neighbors, they 
fit a low-order linear model to a locally- weighted version of the data set jl^ . 
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Trees 141 and neural nets differ in that they search through a globally- 
parametrized class of functions. In all of these methods, cross-validation is 
often employed to estimate frequentist "out-of-sample" performance and se- 
lect a regularization parameter which governs the trade-off between bias and 
variance j43|. 

Wavelet methods are in some ways a compromise between the local and the 
global approaches mentioned above. They fit an explicit global linear model 
to the data, but the basis elements in this model are carefully constructed to 
maintain "localization" (in space and frequency domains). They boast pow- 
erful asymptotic compression and approximation properties, computationally 
efficient transforms, and can employ special thresholding methods which "op- 
timally" choose which coefficients in the model are kept [2^. However, their 
practical use seems to remain concentrated on the case of regularly-spaced 



regression data. Some recent papers address this shortcoming 

Support vector machines (SVMs) 6^ employ a "kernel-trick" to reduce 
consideration of a certain globally-parametrized model class to consideration 
of an equivalent linear model class in an abstract Hilbert space. The estimated 
decision rule corresponds to the solution of a convex optimization problem. 
This objective function still involves an unknown regularization parameter. In 
practice, this parameter is often chosen by cross-validation, but, in principle, it 
can be chosen through consideration of the structural risk minimization (SRM) 
paradigm. The advantage of using the SRM paradigm is that one obtains 
provably valid confidence statements about the error rate that will obtain on 
future data. Moreover, these confidence bounds improve at an exponential rate 
in the number of data points. With realistic sample sizes, however, the bounds 
are often too crude to be of practical use. There are hidden connections between 
SVMs and (1) Bayesian methods employing Gaussian-process priors on the 
regression function (including the generalized spline methods of Wahba IggHgtIi) 
(2) projection pursuit regression 
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Bagging [2| and boosting j2J] are meta-algorithms that "boost" the per- 
formance other classification algorithms (especially trees) by taking carefully 
chosen weighted averages of the results of the boosted (respectively, bagged) al- 
gorithm. There are close connections between boosting and the Lasso penalty, 
which itself is closely related to the least angle regression method (LARS) jSOf. 



Chapter 3 



Computing the Posterior 



This chapter describes a Markov chain Monte Carlo (MCMC) algorithm that 
can be used to (approximately) draw samples from the posterior of the one- 
dimensional random step function prior tt. This is essential, because for a 
complex prior like vr, analytical evaluation of properties of the prior is in- 
tractable. All computation about the posterior, therefore, is made through 
(approximately) generating a large sample from it. Before describing these 
algorithms, it is natural to review the prior and then derive a more refined 
mathematical expression for the posterior. This exercise has the side effect of 
suggesting a more efficient sampling scheme. An informal sketch of the MCMC 
algorithm is then given in lsection 3.41 Additionallv. Isection 3.5l explains an effi- 
cient way to use these samples to calculate the posterior mean. The interested 
reader is invited to download an implementation of these algorithms and others 
from the author's web page. 

To define the algorithm more mathematically, a brief review of Markov 
chains and the Metropolis-Hastings algorithm for general state spaces is given. 
Section 13.71 gives a simple example of such a chain. Section 13.81 gives a math- 
ematical treatment of the more complicated MCMC algorithm that was only 
sketched previously. It also verifies that the Markov chain satisfies detailed bal- 
ance with respect to the posterior. This is done mainly to provide a thorough 
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and relatively self-contained theoretical analysis of the Markov chain. 

Many of these calculations are essentially of the same type as those consid- 
ered by Green Indeed, the birth-death move that I employ is essentially 
equivalent to the ones that he calls a "reversible jump." The verification of 
detailed balance is simple using his results. One need only observe that the 
transformations involved in these jumps essentially only permute coordinates; 
consequently the absolute value of the determinant of the Jacobian of this 
transformation is identically 1. Section reviews two theoretical results that 
give sufficient conditions for a Markov chain to produce the intended ergodic 
sequence. Section 13.101 shows that these results are applicable so that, for 
example, the posterior mean that is computed will indeed approximate the 
intended curve. 

3.1 Specification of the Prior in One Dimen- 
sion 

A review of notation and the prior is in order. Section 11.21 gives an informal 
description of the prior tt and Isection gives a more formal specification of 
the prior vr and its parametrized version it'. The parameter space = W^^^Qk 
where 0^ parameterizes the class of functions / : [0, 1] — > [0, 1] that have k 
locally- const ant regions. Any such function is (essentially) determined by two 
vectors s and v. Vector s lists the k values that the function takes on each 
region (as enumerated from left to right). Vector v lists (in no particular order) 
the k — 1 locations at which the function jumps. This explains the definition: 

0fc := {{k, V, s) : V G [0, s G [0, 1]^'} (3.1) 

01 is a special case because there are no splits in a function that is everywhere 
constant. Define 0i = {(1, 0,s) : s G [0, 1]^}, where the symbol ^ represents 
an empty list (which is not considered equivalent to an empty set). This is 
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consistent with the former definition of 0^ if one allows the notation: [0, 1]" = 

{ }• 

To specify the prior vr, first select a probability distribution k on N. k 
represents the a priori distribution of the number of regions in the unknown 
function. In the introduction, the choice k = Geometric{^) was suggested. 
Assume that K,{k) > for all A; G N, where (technically) K{k) is shorthand 
for K{{k}). To pick a value 6 E Q from the (parametrized) prior vr', first 
draw K ~ k. Then, ii K = k, draw S = s uniformly from [0, l]'^ and draw 
V = V uniformly from [0, Form 6 = (A;,s, v) G 6^. This completes the 
description of vr'. 

For a given point 6 G Qk it is convenient associate a number of objects. 
Let k{6), v(6'), and s{6) stand for the k, v, and s parts of 6 respectively. Let 
V(j) denote the z'th ordered value of v. Additionally, for 6 = (A;,v,s) G 6^ 
associate the function fg whose splits points and values are determined by v 
and s. Specifically, for k > 1, let h = [0,V(i)),J2 = [v(2), V(3) ),..., Jfe_i = 
[v(fc-2), V(fc„i)),4 = [v(fc_i),l]. If /c = 1, just let Ji = [0,1]. Take fg{x) = 

I have chosen to work with uniform distributions on the splits and function 
values. Both of these choices could be varied, e.g. by using one- dimensional 
Dirichlet distributions for the split locations and using Beta distributions for 
the function values where, perhaps, the parameters of the Beta distribution 
are allowed to depend on spatial position. Diaconis and Freedman [2^ adopt 
this level of generality (for the success probability prior) , but I have not found 
it useful. What would be useful (but is avoided for simplicity of presentation) 
is to extend from binary classification to the multi-class case. This can be 
done by generalizing the Beta{l, 1) prior into a discrete Dirichlet prior on the 
class probabilities. Allowing dependencies among the parameters would more 
substantially complicate the analysis. 
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3.2 Representing the Posterior 

The posterior is just the result of conditioning the prior on the data. The 
data is the hst V := {xi,yi), . . . , {xn,yn) where for i from 1 to n, Xi G [0, 1] 
represents where the i'th point occurred and Ui G {0, 1} represents whether the 
Bernoulli{f (xi)) "coin" came up heads or tails. Denote the posterior distribu- 
tion on 6 given the data by vr^: 

7r'^{de) := ir'ide \ V) (3.2) 

Provided that the denominator is non-zero and finite (which it will be) the 
posterior vr^ has a density (j){9) = (f){6; T>) with respect to the prior tt', so that 
TT'^{de) = (j){e)n'{de), where: 

(3.3) 



/g L{e'y(de') 

The likelihood function L{6) = L{6; T>) is defined by: 



m:=\[fe{xr{l-fe{x,)f-^^ (3.4) 



i=l 



Recall that for 6 = {k, v, s), fg{x) = Yl'j=i ^j^x G ' "^^ere the intervals Ij 
implicitly depend 9, but only through v. To evaluate fg{xi), then, is simply a 
matter of determining for which value j from 1 to k, Xi G Ij and then retrieving 
that Sj. Call the value of j for which x G Ij, J(x; 6). Then fe{xi) = s{6)j(^xi:e)- 
Consequently, L is simply a certain product of terms of the form s{6)j or 
1 — s{6)j. To collect these together, define: 

Nj = N]{e; V) = (# of data points in Ij labeled 1) (3.5) 
= N^{e; V) = (# of data points in Ij labeled 0) (3.6) 
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So that for 9 G Qk, (suppressing the dependence on 6 and V from the right 
hand side) L{9) can be expanded as: 

k 

m = ll^f{^-^^f^ (3.7) 

Notice, then, that for fixed model number k and change-point locations v, 
L{6) only depends on the data through the (conditionally) sufficient statistics 
A'^^ and Nj for j = 1, . . . , k. Moreover, for 9 G Ofc, L is simply the product of 
k different binomial likelihood functions. This is intuitively obvious: If I have 
already decided exactly where the change-points are, the only remaining pa- 
rameters are the success probabilities s. Furthermore, according to the model, 
if I get a data-point (x, y), the x necessarily lands in some interval /-,■ and, then, 
the y is just the result of flipping an (independent) coin. 

Also notice that under the prior the S/s are independent f/(0, 1) random 
variables. From the above discussion, it is apparent that if we condition on 
K = k and V = v, the data simply tell us how many times each of the k 
"coins" with success probabilities Si through came up "heads" and "tails" 
as they were (collectively) flipped n times. Consequently, under the posterior, 
conditioned on K = k, \ = \, and on the values of A'^ and Nj, each Sj is an 
independent Beta{Nj + 1, A'° + 1) random variable. We recover this fact by 
direct calculation shortly. 

These observations can be used to motivate an MCMC scheme that is sub- 
stantially more efficient than the naive one that randomly changes k, v, and 
s in the standard Metropolis-Hastings fashion. Namely, we will only have to 
use Markov Chain Monte Carlo steps in order to sample (A;, v) pairs from their 
marginal under the posterior. If desired, we can then create a complete sample, 
including a realization of s by sampling from the k independent Beta random 
variables whose parameters were explained above. To compute the posterior 
mean, s need not be simulated at all. The mean of the Beta distribution can be 
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computed analytically. This avoids substantial Monte Carlo error. For details, 
see sections 13.41 and 13.51 

Continuing with the calculations, write 6 = {k, v, s), and let C = {k} x Ax 
B = {e e Qk ■■ ^ A,s e B} hi A, B measurable subsets of [0, [0, 1]*^ 
respectively. Compute: 



KiC) = n'{C\V) 



Tx'{dd) 



c 



j^L{ey{de) 



(3.8) 
(3.9) 



L{e)7r'{de) = K{k) / L((A;,v,s))dsrfv 

C JveA Js£B 



VGA JseB 



sf ^"\l - s,)^V^-^cisciv 



(3.10) 
(3.11) 



If, in particular, B is the rectangle [ai, 6i] x ■ ■ ■ x [a^, bk] we get: 



L{ey{de) = K{k) 



c 



dw 



(3.12) 



The inner integral is a Beta integral. Consider the special case in which B 
[0, 1]*^ (i.e. C = {k}x Ax [0, 1]^=). Then, 



/ L{e)7r'{de) = ti{k) I pk{^)dw 

Jc Jv£A 



(3.13) 
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Where Pfc(v) and also pk are defined for k > 2 by: 

' ^(v)! /V»(v)! 



Pk ■= / Pfc(v)rfv (3.15) 

"'ve[o,i]'=-i 

For /c = 1, V =0 and Ji = [0, 1] so tliat A'';j^ and Ni are tlie total number 
of heads and tails respectively. In this case define pi = = NllN'^l/lNl + 
N^ + l)l 

The posterior probability of the k'th model is readily computed: 

n'^iQk) = «:(A;)pfc/c (3.16) 

Where the normalizing constant c is the sum: Yl'jLi '^{j)Pj- 

Now is a good time to notice that for any k and v G [0, 1]^^^, and any data 
set V, Pkiy) and pk are both positive and < 1. Consequently, the same holds 
for c. 

Let \j denote Lebesgue measure on [0, 1]-'. In the special case that j = 0, 
let Ao denote counting measure on the set {()}. Then the posterior density of 
the change-points v with respect to Xk-i, given that model k holds is Pfc(v) / Pk- 
Informally: 

7r:(V ed^r\K = k) = Pk{^r)/pkXj{dw) (3.17) 



Finally, the posterior probability that S is in rectangle B = [ai, bi] x ■ ■ ■ x 
[ofc, bk], given that model k holds and that the change-points are given by v is 
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indeed the same as k independent Beta^s: 



<(S e B \ K = k,y = v) 



Pfc(v) 



1 




u 



'n^\l-uf^^^^du (3.18) 



k 



Yl P (5eta(iV/(v) + 1, A^°(v) + 1) G [aj, bj]) du 



(3.19) 



Consequently, if s'j denotes the expected value of Sj under the posterior, 
given that model k holds and that the change-points are given by v, then 'sj is 
just the mean of the Beta{Nj{v) + 1, A^°(v) + 1) distribution: 



This section sets up some basic definitions and ideas that underlie the algorithm 
described in the next section. The first definition, gives a new meaning to 
the symbol X which will be used throughout the remainder of this chapter. 
Elsewhere, X still stands for the covariate space. This should not introduce 
any confusion. 

Write for the parameter space formed from (/c, v) pairs with v G [0, 1]'"'"^ 
and build up the full parameter space X by taking the countable union: 



Again, Xi is a special case. Define xq '■= (1,0) ^"^^ l^t Xi = {xq}, so that Xi 
is a singleton. 

For convenience, let k{x) stand for the /c-part of x and let x stand for the 
v-part of X. Bear in mind the nuisance that for x G X^, x G [0, l]'^"^. Extend 



iV/(v) + l 



(3.20) 



iV;(v) + A^0(v) + 2 



3.3 Setup 



Xk■.= {{k,^r) : v G [0, l]'^-!} 



(3.21) 
(3.22) 
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this definition to sets; i.e. for a subset C C Xk, define C_a,s {x : x G C}. 

For future reference, endow X with the a-algebra B generated by sets of the 
form k X B where k > 2 and B is some (Borel) measurable subset of [0, l]'^"^ 
and by the singleton set Xi. Let r denote the extension of Lebesgue measure 
to X. That is, for k > 2 and B a (Borel) measurable subset of [0, l]'^"^ define 
the r measure of a set C = x i? c A^, as the k — 1-dimensional Lebesgue 
measure of C_ = B. To account for k = 1, let t{Xi) = 1. 

Finally, combining the results in Equation 3.16| and Equation 3.17[ the dis- 



tribution under vr^ of the random point X = {K, V) in X can be computed. In 
the present notation, X has the density (f){x) with respect to r: 

0(x) := K{k)pki^^)/c (3.23) 



3.4 An MCMC Algorithm 

This section contains a description of a (randomized) computational algorithm 
to produce a random sequence 6i, . . . , 9m drawn from the posterior distribution 
TT^. A more formal version of this algorithm will be developed in Isection 3.81 
Finally in Isection 3.101 it will be shown that the generated sequence has an 
ergodicity property. The main consequence of this is that if g is some measur- 
able function with J \g{9)\TT'^{d9) < oo, one can use the average value of g on 
the sampled values to approximate the integral, in the sense that: 

1 r 

9{e)<{d9) (3.24) 

j=i 

The algorithm exploits the observation made in Isection 3^ that under the 
posterior, conditioned on K = k, \ = \, and on the values of A'^^^ and A'^^, 
each Sj is an independent Beta{Nj + 1, A^^ + 1) random variable. Since Beta 
random variables are easy to simulate and work with analytically, it suffices 
to simulate {k, v) pairs from the (marginalized) posterior, instead of the full 



CHAPTER 3. COMPUTING THE POSTERIOR 



32 



9 = {k,^r,s). 

Essentially, the MCMC algorithms that are developed in this chapter are a 
minor variation of the usual birth-death MCMC approach that is often used to 
simulate point processes. This approach is described by Geyer and M0ller js^ . 
They claim geometric ergodicity for their Markov chain, under suitable restric- 
tions on the sampling density. One technical distinction from these approaches 
is that when simulating a point process, the fundamental object is a subset of 
points {vi, . . . , Vfc_i}; the theoretical treatment given in this chapter considers 
instead mathematical objects of the form (/c, (vi, . . . , v^^i)). 

As far as the computations are concerned, though, there is no distinction 
between these formally different objects. Both could be represented on the 
computer operationally as a simple list of numbers called x with each number 
in the list specifying the location of a particular change-point. The algorithm 
merely assumes that it can call a function that evaluates to the positive 
real number defined by [Equation 3.23 

In particular, one can use k{x) to find out that there are k — 1 elements in 
this list. Furthermore the computer has no problem removing elements from 
the list all the way down to the empty list, or (ideally) adding elements one-by- 
one indefinitely. To agree with the notation of the previous section, if k{x) = k, 
then for 1 < j < A; — 1, write Xj for the j'th element of the list. Write Xq for 
the empty list. 

Let M be the number of Monte Carlo samples that are desired. The algo- 
rithm simulates the workings of a Markov chain and generates the realizations: 
xi, X2, . ■ ■ , xm- It is assumed that for j from 1 to 5, pj is a positive number, 
and that these numbers sum to one. These represent the mixture probabilities 
with with 5 component Markov chains are combined. For my computations, 
through a mixture of intuition and trial-and-error, I chose p as in ITable 3.T1 
These values are by no means optimal. (The irregular numbers quoted here 
result from standardizing simpler ones so that they add to 1.) 
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Pi 


P2 


P3 


P4 


P5 


0.1429 


0.1429 


0.2381 


0.0476 


0.4762 



Table 3.1: MOM C Mixture Probabilities 



Main MCMC Algorithm 

Set X = xq. 

For i ranging from 1 to M, repeat the following steps: 

1. Pick J at random from 1 to 5 with the probabilities pi through 
respectively. 

2. Pick a "proposal" point y by following the subroutine specified in Ac- 
tion J (defined below). 

3. Calculate a = min(l, 

4. With probability a, set x = y. 

5. Set Xi —— X. 



Action 1: Add or Delete a Random Coordinate 
Set y = X. 
Flip a fair coin. 
If heads: 

Generate U uniformly at random on [0, 1]. 
Add U to the end of y. 
Return y. 
If tails: 
Set y = X. 
If y is empty: 

Return y. 
Otherwise: 

Permute y randomly. 

Delete the last entry from y. 

Return y. 
End if 
End if 
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Action 2: Randomize a Coordinate 

Set y = X. 
If y is empty: 

Return y. 
Otherwise: 

Permute y randomly. 

Generate U uniformly at random on [0, 1]. 
Replace the last coordinate of y with U. 
Return y. 
End if 



Action 3: Shift a Coordinate 

Set y = X. 
If y is empty: 

Return y. 
Otherwise: 

Permute y randomly. 

Set Ul = the last element of y. 

Set U2 = Ul+a, random normal with mean and standard deviation 0.1. 
If U2 is in [0,1]: 

Replace the last coordinate of y with U2. 
End if 
Return y. 
End if 



Action 4: Randomize All Coordinates 

Set y = X. 
If y is empty: 

Return y. 
Otherwise: 

For each coordinate of y do: 

Generate U uniformly at random on [0, 1]. 
Replace the current coordinate of y with U. 
End for 
Return y. 
End if 



CHAPTER 3. COMPUTING THE POSTERIOR 



35 



Action 5: Shift All of the Coordinates Very Slightly 
Set y = X. 
If y is empty: 

Return y. 
Otherwise: 

For each coordinate of y do: 

Set Ul = the current coordinate of y. 

Set U2 = Ul+a random normal with mean and standard deviation 0.01. 
If U2 is in [0, 1]: 

Replace the current coordinate of y with U2. 
Otherwise: 

Continue to the next coordinate. 
End if 
End for 
Return y. 
End if 



If desired, the resulting xi,...,xm can be randomly augmented into a 
sequence 9i,...,9m- To do so, simply generate all the necessary Beta ran- 
dom variables in order to sample S from its conditional distribution (c.f. 
Equation 3.18). 



3.5 Posterior Mean Calculation 

There are many potential uses of for the sample of 6 values that can be approx- 
imately drawn from the posterior using the algorithm in the previous section. 
For example, for each sampled 6, a plot of the corresponding fg can be made, 
and inspecting some of these can give some idea about how confident to be 
about the shape of the unknown regression function. 

This section, though, concentrates on estimating the mean of these func- 
tions. Call the resulting function the posterior mean, /. It represents the 
posterior's best estimate (in an sense) of the unknown regression function. 
More formally, define the value of / at a point u G [0, 1] as the expected value 
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of fe{u) under the posterior on 9: 



f{u) = I fe{x)K{d9) (3.25) 
sj{u;x)(p{x)T{dx) (3.26) 



X 



Where 'sj was defined by [Equation 3.20[ (f){x) was defined by [Equation 3.23 
and T was defined shortly before </>. 

The algorithm from Isection 3^ can be used to approximately generate a 
sample Xi, . . . , Xm from 0(x)r((ix) and then estimate this integral by: 



AI 

X) (3.27) 



1 ^ 
77 sj{uv. 



M 

i=l 

3.6 Metropolis-Hastings Markov Chains on Gen- 
eral Spaces 

Where does the algorithm described in Isection 3.41 come from? This section ad- 
dresses the MCMC approach and begins a description of the larger framework 
within which algorithms like this one can be derived and evaluated. 

Generally, MCMC techniques suggest how to formulate algorithms (more 
specifically Markov chains) that may be useful in order to sample a stationary 
ergodic sequence that converges to a given stationary distribution. This s ubj ect 
is very broad and active. For a review of the main ideas, see Tierney [6J| or 
Liu j50|. MCMC techniques have opened up to numerical investigation a wide 
variety of Bayesian procedures, especiall y w ith the advent of the Gibbs sampler, 
the Metropolis-Hastings algorithm jsi .]4^. and its extension to the problem 



of "model determination" through "reversible jump" MCMC (Green 41|). 

In very general terms (following 0), the Markov chain setup is as follows. 
Let TT be a probability distribution on a measurable space {X,B). Let P be a 
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transition probability function on this space, that is, P is a function on X x B 
such that, for each x E X, ■) is a probabihty measure on {X,B) and, 
for each C G i3, P{-,C) is a measurable function on {X,B). The Markov 
chain Xq, Xi, X2, ... is generated as follows. We fix a starting point Xq = xq, 
generate an observation Xi from P{Xq,-), generate an observation X2 from 
P{Xi, ■), and so on. 

P will be constructed to obey detailed balance with respect to n. Namely: 



This condition is very convenient because although it will be easy to construct 
chains that satisfy it, it is also powerful. In particular, (by choosing B = X) 
it implies that vr is an invariant measure for the Markov chain, that is. 



The goal is to choose P so that vr is the unique invariant measure; and, moreover 
that the Markov chain will produce an ergodic sequence of observations from 
vr. For this goal, detailed balance is a useful (although not necessary) "first 
step." 

Suppose P is some transition probability function which (presumably) does 
not satisfy reversibility with respect to vr. Let 'jl{dx,dy) := Ti{dx)P{x,dy). 
Suppose that Jl is absolutely continuous with respect to some symmetric cr- 
finite measure /i. Specifically, suppose that /i is a measure on the measurable 
space {X X X, a{B x B)) that satisfies j2{A xB) = j2{B x A) for all A,BeB. In 
simple cases, one can choose to be a product measure; for example Ti{dx)Ti{dy) 
or \{dx)\{dy) where perhaps vr has a density with respect to A. If no such 
measure is readily available, one may take n{dx, dy) = ^Jl(dx, dy) + \]l{dy, dx); 
i.e. fi{A X B) = |/u(/l X B) + \Ji{B x A) for A,B & B. Let p be a version 
of the Radon- Nikodym derivative of Ji with respect to /i so that //(dx, dy) = 




(3.28) 




(3.29) 
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p{x,y)fi{dx,dy). If p{x,y) happens to be symmetric; i.e. p{x,y) = p{y,x) for 
all x,y G X, then P aheady satisfies detailed balance with respect to tt. To 
verify this, compute that for every every A,B & B: 

P{x,B)7r{dx) 



A 



f f 

/ / Ti{dx)P{x,dy) 

1 x&A JyeB 


(3.30) 


/ P{x, y)fi{dx, dy) 

J {x,y)€AxB 


(3.31) 


/ P{y,x)fi{dx,dy) 

' {x,y)€AxB 


(3.32) 


/ p{x, y)fi{dx, dy) 

J {x,y)€BxA 


(3.33) 


[ [ n{dx)P{x,dy) 

JxeB JyeA 


(3.34) 


/ P{x,A)Tr{dx) 

IB 


(3.35) 



When p is not symmetric, there is no trouble in constructing the closely 
related symmetric function p{x,y) = mm(j){x,y),p{y,x)). Does this suggest 
how to construct a probability transition function P based on P that satisfies 
detailed balance with respect to vr? Yes, fortunately it does. Define: 

mm{l,p{y,x)/p{x,y)) iip{x,y)>0 
a{x,y) = { ^ (3.36) 

1 if p{x, y) = 

Then (check) p{x,y) = a{x,y)p{x,y) for all x,y ^ X. This suggests defining 
Q{x,dy) = a{x,y)P{x,dy), so that 7i{dx)Q{x,dy) = a{x,y)p{x,y)^{dx,dy) = 
p{x,y)^{dx,dy). The only problem is that Q{x,-) is not a probability (in 
general), but a sub-probability. To account for the "forgotten" mass, set h{x) = 
1 — Q{x, X) for all X & X . And define P{x, dy) = Q{x, dy) + h{x)6x{dy). Here, 
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Sx{-) stands for the Dirac measure at x. In expanded form, this defines: 



In conclusion, one can now easily verify that this defines a probability tran- 
sition function P which satisfies detailed balance with respect to vr and which 
is a simple modification of P. Indeed, P is simply a modification of P that 
sometimes "holds" instead of taking the transition that P proposes. Suppose 
that Y = y is a particular value drawn from P{x, ■). That is, suppose that P 
has "proposed" the transition from x to y. Then P "accepts" this transition 
with probability a{x,y), but holds with probability 1 — a{x,y). Furthermore, 
because a{x,y) only depends on the ratio p{y,x)/p{x,y) it is sufficient to be 
able to compute p{x, y) up to an unknown constant factor. 

Clearly, then, P is a computationally simple modification of P; the only 
caveat is that a{x, y) may often be very small or even so that in the extreme 
degenerate case in which a = 0, P is the Markov chain that always holds. 
Indeed, this Markov chain is reversible with respect to any distribution, but 
it certainly does not serve the larger goal of producing an ergodic sequence of 
realizations from vr. Similar problems can occur if P is not transitive or is oth- 
erwise unsuitable. For these reasons the ergodicity conditions from lsection 3.91 
are needed. 



A simple example is in order to make these ideas more concrete. This chain 
will not be as efficient (in practice) as the local-move Markov chain developed 
in the next section. 

In words, this will be the chain that stays fixed (holds) at its current value 
X ^ X until a new value yX drawn from the prior is accepted; y will always 
be accepted if y makes the data more likely (i.e. higher predictive probability 



P(x, dy) = a{x, y)P{x, dy) + 




{1 — a{x,z))P{x,dz) Sx{dy) 



(3.37) 



3.7 A Simple Markov Chain 
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under our model); otherwise it is accepted with a probabihty reflecting the ratio 
of the predictive probabilities. In this way, the chain readily walks "uphill," 
but, with just the right probability (because of the detailed balance condition 
that will be shown) it also walks "downhill." 

Recall the notation from Isection that defined the measurable space 
{X,B) and denote the posterior distribution on this space by vr^. For con- 
venience, recall that for any x & X , x = {k,\) and write k(x) = K{k{x)) 
and p{x) = p(v(x)) so that the prior vr on points y & X has density K,{y) 
with respect to r. For any x & X and any B & B, let P be defined by 
P{x,B) = J^^^ K{y)T{dy). That is, P is the probability transition function 
that (without reference to x) samples a new point y & X from the prior. Now 
expand '7r'„{dx)P{x,dy) = (t){x)K{y)T{dx)r{dy) = {p{x) / c)K{x)K{y)T{dx)T{dy). 
Conveniently then, this distribution has a density with respect to the prod- 
uct measure K{x)T{dx)n{y)T{dy). To agree with the notation in the previous 
section, let p denote this product measure and let p{x,y) = p{x)/c. Recall 
that each of these terms is always positive. Because of this and a convenient 
cancellation, the expression for a becomes 

a{x,y) = min(l,p(y)/p(x)) 

Finally, as before, define Q{x,dy) = a{x,y)P{x,dy), h{x) = 1 — Q{x,X), and 
set P(x, dy) = Q{x, dy) + h{x)6x{dy). 

3.8 A Local-Move Markov Chain 

This section gives a formal definition of the Markov chain type algorithm that 
was explained in Isection 3.4[ It then shows that this chain satisfies detailed 
balance with respect to tt^. To introduce this more useful, but more compli- 
cated Markov chain on {X,B), some notation is needed. When j > 2, and 
V G [0, 1]-', write v_ for the vector (vi, . . . , Vj_i) G [0, l]-'"^ which leaves off the 
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last coordinate of v. Consider some point x = {k,\) G X. Let gi{x) stand for 
the point in X that is "one level down" from x with the last coordinate of v 
having been removed. That is, when A; > 3 and v = (vi, . . . , Vfc_i) G [0, 1]'^"^, 
let gi{x) = {k - l,v„). For x = (2, (vi)) G X2, let g^ix) = (1,0). For 
X = (1,0) £ '^i) there is no further down to go, and so let gi{x) = x. Let 
A^(-) = ^g|(x)(-) denote the Dirac measure at gi{x). 

Similarly, let g^{x,u) stand for the point in X that is "one level up" from 
X, where the last coordinate is filled in with u. That is, for k > 2 and v = 
(vi, . . . , Vfc„i), g^{x,u) = (/c+l, (vi, . . . , Vfc_i,M)). For X = (1, 0), let g^{x,u) = 
(2, (n)). Let A|(-) denote the distribution of g^{x,U) where U is uniformly 
distributed on [0, 1]. 

Let Aj(-) denote Lebesgue measure on W and for i? G i3 where B G X^, let 
= {v G [0,l]^'-i : (A;,v) G B}. 

To define the transition probability function P on {X, B) satisfying detailed 
balance with respect to tt^, I first define various transition probability functions 
Pj{x,dy); then, for a generic function aj{x,y), define: 

Qj{x, dy) = aj{x, dy)Pj{x, dy) (3.38) 

Next the a/s are chosen so that for every j, Qj satisfies the appropriate detailed 
balance formula: 

/ Tx{dx) / Qj{x,dy)= / 7r{dx) / Qj{x,dy) (3.39) 

JxgA Jy&B Jx&B Jy&A 

It is easily verified then that Pj = Qj{x,dy) + [1 — Qj{x, X)]6x{dy), is a 
transition probability which satisfies detailed balance. Finally, set:P(x, dy) = 
T.jPjPj{x,dy). 

Generally the proposals I consider are "symmetric" and so aj {x, y) will work 
out to be min(l, 0(t/)/0(x)) in each case. An exception is in [Equation 3.86 
but when the proposal density is simple, (as it is in the case of interest) it also 
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reduces to the previous case. 

To begin, set Pi{x,dy) = jX^{dy) + ^X^{dy). This transition probabihty 
function represents the chain that adds or deletes coordinates from the vector 
V randomly. 

To choose ai, first compute the left and right hand sides of the detailed 
balance equation for Qi. Let A,B & B with A C and B C Xk. There are 
three cases of interest for j and k: (1) k = j + 1, j > 2, (2) j = 1, k = 2, (3) 
j = l,k = 1. These account for all the possibilities because if j < k, simply 
replace the roles of j and k; if k > j + 1 or j = k ^ 1 both sides will evaluate 
to 0. If ai can be chosen to set the left and right sides equal for every such 
case, detailed balance is proven because general A,B E B can be decomposed 
into these component subsets. 

Suppose that k = j + 1, and calculate: 



7r{dx) / Qi{x,dy) (3.40) 

<j>{x)a,{x,y)T{dx)[lx^^{dy) + ^A^(d|/)] (3.41) 

]-4>{x)ax{x,y)T{dx)\'{{dy) (3.42) 

x^A JyeB 2 

^«^(j)Pi(v)ai((j, v), ^t((j, \),u))Xj^i{d\)du (3.43) 

VGA Ju£C{v;B) ^ 

^'^(j)Pi(w-)ai((j, w_), ^|((j, w_), u))Xj{dw) (3.44) 



xeD 2 



7;<Pi9i{x))(^ii9lix),x)T{dx) (3.45) 



Where: 



C{^r■B) = {ue [0,1] : g^{{J,^r),u) e B} 



(3.46) 
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D = {we[0,lY ■ e A,Wj e C{w^;B)} (3.47) 

= {w eB : w_ e A} (3.48) 

D = {x = {k,w) e Ck : w eD} (3.49) 

= {xeB : g^{x) G A} (3.50) 

Similarly, compute: 

[ Tx{dx) I Qi{x,dy) (3.51) 

J x&B Jy<^A 

= [ [ <P{x)a,{x,y)T{dx)[lx^dy) + lxi{dy)] (3.52) 

= [ [ h{x)a,{x,y)T{dx)Xl{dy) (3.53) 

JxGB Jy&A ^ 

^ L B l'^^^^^^^^' 9i{x))r{dx)lg^(^^^^ ^ ^ (3.54) 

= / 7:4>{x)ai{x,gi{x))T{dx) (3.55) 



Suppose that j = 1, A; = 2, so that A = {xq} or yl = where xq = {1, 0) and 
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any x G -B is of the form (2, (vi)). Then: 



Tr{dx) / Qi{x,dy) 

x£A Jy€B 

0(x)ai(x,y)r(rfx)[^Af(rf?/) + ^Af(c/?/)] 

1 



(l){x)ai{x, y)T{dx)X\{dy) 



1 

yes 

f 



2^xo G A<^(^o)ai(a;o,2/)Af (dy) 
^^xo G A<^(^o)ai(a;o, (2, vi))dvi 



^0(xo)ai(xo, x)r(c/x) 



Similarly, compute: 



7r{dx) / Qi{x,dy) 

xG-B >^3/GA 

(/>(x)ai(x,?/)r(dx)[^Af(rfy) + ^A];(dy)] 

xG-B JyeA ^ ^ 



1 

xG-B Jy^A 



y)T{dx)\\{dy) 



^4>{x)ax{x,g^{x))r{dx)\^^i^^^ ^ ^ 
zgd 2 

The only remaining case to compute is where j = k = 1 and the only case 
of interest here is the one in which A = B = {xq}. Even this is trivial, because 
the left and right sides of this symmetric case must surely match. 

Recalling that > for all x & X, for x,y & X detailed balance is 
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achieved upon defining: 

ai(x, y) = min(0(y)/0(x), 1) (3.67) 

To define P2 additional notation is needed. For A; > 2, 1 < j < A; — 1 let 
Ai^(-) denote the distribution on {X, B) under which the j'th coordinate of the 
V in X = (/c, v) is replaced with a uniform random variable. For k > 2 and 
X G Xk, let \^{dy) = -^Yl'jZi ^^j{dy), i.e. the probability distribution that 
picks a coordinate of the v in x randomly and then changes it to a uniformly 
random value. For x = xq, let X^{dy) = 6xo{dy). 

Now define P2{x,dy) = \^{dy). More generally, define: 

dy) = ^(x, y)\%dy) + h^{x)5^{dy) (3.68) 

Where \l/ is any (measurable) non-negative function on X x X satisfying for 
all x,y e X: (1) ^(x,y) = *(y,x), (2) j^^m{x,y)X-{dy) + h^{x) = 1 for 
some non-negative hq,{x). For example, if x = (/c, v^,), y = (/c, Vy), and 00- 
represents the univariate normal density with standard deviation a > 0, take 
\E'(x, y) = 0cr(| |va;— VyI I2) and /;.^(x) = 1 — /^^ '^{x,y)X^{dy). Fortunately, there 
will be no need to compute h<i,] it is enough to note that it is non-negative. 

To check the detailed balance for Qs it is sufficient to consider A, B E B 
where A, B G Xk- 
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/ vr(rfx) / Qsix^dy) (3.69) 
<P{x)a^{x,y)T{dx)[^{x,y)X%dy) + h^{x)5.,{dy)] (3.70) 



0(x)a3(x, ?/)*(x, y)T{dx)\':{dy) 
/ (f){x)a3{x, x)hq,{x)lrf, g j^T[dx) 

Jx&A 

(t){x)a^{x, y)^(x, y)T{dx)\':{dy) 

x^A JyGB 

+ / 0(x)a3(x, x)/i^(x)r((ix) 



(3.71) 



(3.72) 



This second term in the latter expression does not change if we replace A, B 
with 5, A and consequently needs no further consideration. To expand the first 
term, recall that in this situation A^^ just picks one of the k — 1 coordinates 
and randomizes it. Let >S'^(w) stand for the modified version of w in which the 
j'th and fc'th coordinates of w are swapped. In suggestive notation, let: 

X = X{w) = {k,w.) (3.73) 
Y = Y{w)= X(5,^(w)) (3.74) 



Dj = {wE [0, l]'^ : X(w) e A, r(w) e B} (3.75) 



D'j = {we [0, 1]'= : X{w) e B, Y{w) G A} (3.76) 

Notice that w e Dj S^{w) e D'j. 

Then, suppressing the w dependence from X and Y, the first term expands 
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^ k-l 

k 



^ V / <j){X)a,{X,Y)<i/{X,Y)X,{dw) (3.77) 

k—1 

-i-V/ (P{Y)as{Y,X)^{Y,X)Xk{dw) (3.78) 

77^ E / <PiY)as{Y,X)^{XX)Xkidw). (3.79) 



In summary then, (using Equation 3.79) 



7r{dx) / Qsix^dy) (3.80) 

fe— 1 

= -l-V [ <P{Y)a;iY,X)<iJ{X,Y)\k{dw) + const (3.81) 

^ - 1 iwGD; 



While, using Equation 3.77 with the roles of A and B interchanged so that 



TT{dx) / g3(a;,rf?/) (3.82) 
x€B Jy&A 

k—1 

= -^V/" (l)(X)a3(X,Y)^(X,Y)\k(dw) + const (3.83) 

Again we achieve balance using: 

asix, y) = mm{<p{y)/(j){x), 1) (3.84) 

Define P^i^x, dy) = ix{y)'T{dy). Where for any x G A", ^xi.-) is a density with 
respect to r. Accordingly: 

'K'^{dx)Pi{x, dy) = (f){x)^x{y)r{dx)T{dy) = p{x, y)T{dx)T{dy) (3.85) 
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By the arguments in lsection 3.61 the corresponding transition probabihty func- 
tion P4 satisfies detailed balance with respect to ir'^ provided that: 




For example, for x G A^, choose ^x{y) = ^ ^^^^ 
familiar choice ai{x,y) = min(l, 0(i/)/</)(x)) works equally well. 

Finally, define P5{x, dy) for x G Af^ as the composition of the k — 1 separate 
transition probability functions P5 applied sequentially from j = 1 to j = A; — 1. 
Each Pi is just intended to move the j'th coordinate of v^^ by a small amount 
(or hold). In effect, then, P5 moves all of the coordinates of x randomly, but 
technically speaking some subset of them may hold on any given step. For 
X E Xk and 1 < j < — 1 set Pl{x,dy) = '^{x,y)\^j{dy) + hif,{x)6x{dy) for 
some \E' and satisfying the same constraints as made when defining P3. I 
omit the verification that by choosing = min(l, as usual, P5 

satisfies detailed balance. 

This concludes a consideration of each chain Pi through P5. Each was 
shown to satisfy detailed balance with respect to tt^. Consequently, their mix- 
ture P also satisfies detailed balance with respect to vr^. 

As an aside, notice that one may compose any of the Pj with a random 
permutation since itself is invariant with respect to permuted v. Doing so 
will allow Pi to add and (by pre-composing) delete coordinates in arbitrary 
locations. 



3.9 Markov Chain Convergence Theory 

Early results on the ergodicity of Markov chains on general state spaces used 
a condition known as the Doeblin condition. It implies that there exists an 
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invariant probability measure to which the Markov chain converges at a geo- 
metric rate, from any starting point. ^ 



Theorem 1 (Doob (1953) (28|]). Suppose that the Markov chain on the 
measure space {X, B) generated by probability transition function P satisfies 
the Doeblin condition that there is a probability measure A on {X, B), an integer 
k, and an e > such that: 

P^(x, C) > eX{C) for all x e X and all C e B 

Then there exists a unique invariant probability measure tt such that for all n, 

sup|P"(x,C) -7r(C7)| < (1 - e)("/'=)-^ for all x e X 

An easy corollary is that if P satisfies the conditions of the theorem, and 
was already known to be reversible with respect to some specific distribution, 
then that same distribution must be the unique stationary distribution vr. The 
Doeblin condition is quite strong and rarely holds in applications. 

Athreya, Doss, and Sethuraman ^ prove an ergodicity result for general 
state spaces whose conditions hold much more broadly and remain reasonably 
easy to check. An abbreviated version is given below 

Theorem 2 (Athreya, Doss, and Sethuraman (1996)). Suppose that the 
Markov chain {Xn} with transition function P{x,C) has an invariant probabil- 
ity measure vr, that is \Equation 3. 29^ holds. Suppose that there is a set A eB, 
a probability measure A with \{A) = 1, a constant e > 0, and an integer % > 1 
such that: 

7r({x : P,(X„ G A for some m > 1) > 0}) = 1 (3.87) 



^This section reviews some material given in a paper by Athreya, Doss, and Sethura- 
man Pi 
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and 

P"o(x, ■) > eA(-) for each xeA (3.88) 
Further suppose that either no = 1 or that 

g.c.d.jm : there is an > such that P"''(x, ■) > emA(-) for each x G A} = 1 

(3.89) 

Then there is a set D E B such that 

7r{D) = 1 and sup |P"(x,C) - 7r(C)| ^ for each xeD. (3.90) 

Let f{x) be a measurable function on {X,B) such that f 'n'{dy)\f{y)\ < oo. 
Then 

P. ^ j ^(^^y)/(y)| = 1 for [n]-almost all x (3.91) 

3.10 Convergence Results 

A short proof suffices to show that theorem |21 apphes to the local-move Markov 
chain. It is assumed that mixing probability pi > 0. The proof takes advantage 
of the atom xq = (l, 0. 

Let P denote the local- move Markov chain from lsection 3.81 It was already 
verified there that P satisfies detailed balance with respect to vr^, and it follows 
that P has invariant probability measure vr^. Let A = Xi, the singleton set 
{xo}. Let A(-) be counting measure on Xi. Let no = 1. Set e = P(xo,A'i), 
i.e. the chance of holding at this atom. This quantity is positive (because 
P{.Xq, Xi) > PiPi(xo, {xo}) > PiPi(xo, {xo}) > 0). 

This verifies all of the conditions of theorem |21 except condition l3.87l It suf- 
fices to show that for any A; G N and any x E X^, Px{Xm G A for some m > 1) > 
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0. It has already been shown that for any x & Xi, P^^Xm G A for some m > 1) > 
0, since Xi is the singleton set considered earlier and because this quantity is 
greater than P(xo,A), which was positive. Suppose, for induction, that for 
any k < ko and any x ^ X^, Px{Xm € A for some m > 1) > 0. Consider any 

P{x,Xk,)=PiPi{x,Xk,) (3.92) 

y)X'^{dy) + y)\l{dy) + hi{x)5^{dy) 





(3.93) 


\pi / (^i{.x,y)Xl{dy) 


(3.94) 


\pi I (^i{x,y)^g^{x){dy) 


(3.95) 


]^Piai{x,gi{x)) 


(3.96) 


]^Pi min(0(^i(x))/0(x),l) 


(3.97) 



Now, both </)(x) and (f){gi{x)) are positive and finite, so their ratio is as well, 
and so P(x, X^^) > 0. This proves that for any x G Af,to+i, 

P^iXm e A for some m > 1) > (3.98) 

All the conditions are now verified. 

The same argument goes through without modification for the simple Markov 
chain from Isect ion 3.71 

In summary the preceding sections have proven the following theorem: 

Theorem 3. Let {X,B) be the measurable space defined in \section 3.'^ Let P 
be the probability transistion function on this space that is defined as a mixture 
of the component probability transistion functions Pi, through P^, explained in 
\section 3.^ with mixture weights pi, through p^ positive and summing to 1. 
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(This Markov chain is a formal version of the Algorithm given in section ^^\ ) 



Then, P satisifies detailed balance with respect to the distribution tt^ defined 
by \Equation 3.23[ Furthermore, P satisfies the conditions of Theorem and 
therefore, n'^ is the unique invariant distribution of P. Indeed, there is a set 
D E B such that 

<(L>) = 1 and sup | P"(x, C) - 7r^(C)| ^ /or eac/i x G (3.99) 

Let f{x) be a measurable function on {X,B) such that J 'n'oidy)\f{y)\ < oo. 
Then 



{dy)f{y) }■ = 1 for [n'^]- almost all x (3.100) 



Chapter 4 
Examples 



This chapter describes the results of a variety of simulation experiments that 
I have conducted to better understand and evaluate the performance of the 
posterior mean estimates based on the prior vr. In all of these experiments, 
except where specifically noted, the prior k on the number of steps K is taken 
to be Geometrici^). Other Geometric priors are considered in lsection 4.41 and 
Isection 4.71 A few examples in [chapter 6 consider Poisson priors. There, they 
are discussed in relation to the convergence theory proven in [chapter "5) Most 
of this chapter, however, concerns evaluating how efficient the Geometric{\) 
posterior mean estimate is by comparing it with a wide variety of competing es- 
timation procedures. The first section establishes the standard format for these 
experiments. It compares the posterior mean estimate with CART and bagged 
CART estimates and interprets the results. Other methods are considered and 
compared in Isection 4!2l namely: a Lasso example that is connected with bag- 
ging, three smoothers, and some wavelet-based estimates. Finally, the dyadic 
Diaconis and Freedman binary regression prior is compared in lsection 4.31 Sec- 
tion [33| takes a step back to analyze the interaction between the data and the 
model by inspecting how the predictive probability chang function of 

where splits are placed. The final sections experiment with smaller and larger 
data set size, and also evaluate the performance of the posterior mean on a 
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more challenging regression problem. 

4.1 Comparison with CART and Bagged CART 

Since vr puts a prior on piecewise constant regression functions, it is natural to 
ask how its performance compares with conventional estimators that employ 
piecewise constant approximations. Of particular interest is the Classification 
and Regression Tree (CART) algorithm 0], and the closely related bagged 
CART algorithm. These methods are briefly reviewed in the next two sections. 
The impatient reader should skip ahead to lsubsection 4.lT3l where the methods 
are compared with the posterior mean estimate on simulated data sets. I have 
not "filtered" the experimental data at all: these are the originally simulated 
data sets in their original order. Technically, there has been a certain amount 
of filtering in the results because I did try using a variety of settings for the 
CART and bagged CART methods that I do not discuss. The choices that are 
presented are among the more standard and better performing possibilities. 
As far as the posterior mean results, these are not filtered at all, except that 
arguably I would not have a thesis if the results were not interesting. 

4.1.1 CART Review 

The CART algorithm prescribes how to select a "tree" that represents a good 
estimate of the unknown regression function (or classification rule). The "tree" 
terminology connotes the fact that the covariate space X is recursively parti- 
tioned with each piece assigned its own regression value: this recursive partition 
can be naturally associated (by inclusion) with a graph-theoretic tree. For com- 
pleteness, a brief description of the CART algorithm is in order. The reader 
should bear in mind that CART and related algorithms have been in use for 
many years now and so there are a variety of possible tweaks and alternatives 
that I do not discuss. The CART algorithm also has important advantages that 
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the following discussion will not address. Its regression estimate, for example, 
is unaffected if individual coordinates of the covariate vector are rescaled, so 
that the estimate does not depend upon the units of measurement. It is capa- 
ble of dealing with large data sets because of its efficient implementation. It is 
also very easy to interpret (although this is a risky business since the estimates 
can sometimes change dramatically with new data). 

Suppose for simplicity that the covariate space X is M'^ (specifically, the 
case A" = is of interest at present). In its basic form, the CART algorithm 
uses coordinate-aligned splits. That is, if a certain subset A of A" is being 
partitioned, it will be partitioned into the two subsets {x & A : xj < c} 
and {x E A : Xj > c} for some choice of j G {1, . . . ,d} and c G M. CART 
proceeds to construct a partition of A" in a greedy manner. That is, it begins 
by finding the binary partition of X that maximizes a certain splitting criterion 
and (proceeding recursively) all subsequent splits are subordinate to this one. 
Ultimately, the splitting criterion is chosen by the user, and I have chosen to 
use the ANOVA criterion. To explain this criterion consider that at any given 
stage of partitioning there is a certain class of possible real-valued functions 
that are constant on each partition element. Among this class, the function 
that minimizes the mean of squared residuals to the response data yi, . . . , y„ 
(MSE) is clearly the one whose value on any given element of the partition 
is the mean of the response values whose covariates "hit" this element. The 
split that is considered best is the split that results in the greatest possible 
reduction in this measure of residual error. Having chosen a split, the CART 
algorithm continues, recursively, to split the resulting subsets. Implicitly, it is 
building up a "tree" of subsets at each stage with X forming the root. The 
recursion terminates whenever there is only a single data point in the current 
partition element. Call the resulting binary tree of subsets the "full" tree. 
CART then proceeds to "prune" this tree. This operation depends critically 
upon a complexity parameter cp, which must itself be chosen. Typically cp 
is chosen by cross-validation with the restriction that it not be smaller than 
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some user specified value. For these experiments I choose cp using 10-fold 
cross-validation and the one-standard-deviation selection rule. This means 
that if the best achievable cross-validated measure of error (searching over all 
possible values of cp) is xerr and the sample standard deviation of xerr is xstd, 
the selected value of cp will be the largest value whose xerr does not exceed 
xerr+xstd. To prune the full tree using parameter cp, each pair of leaves is 
considered in turn; if the pair does not improve the splitting criterion by at 
least the value cp, it is removed. Ultimately, every pair of leaves might be 
removed, resulting in the tree consisting only of the root node. Finally, the 
pruned tree corresponds to the estimated regression function. It is constant on 
each element of the (pruned) partition and its value on a given element is the 
mean of the response values there. 

4.1.2 Bagging Review and Discussion 

A comparison with the bagging procedure 0| is also relevant. Bagging 0] is 
a meta-algorithm that can (hypothetically) be applied to any existing clas- 
sification or regression technique in an effort to improve them. This thesis 
focuses on bagged CART. Essentially, the bagging idea is just to take many 
bootstrap resamples of the data set, apply some existing technique to each 
resampled data set, and then take an average of all of the resulting regression 
estimates. For completeness, to form a bootstrap resample of a dataset with n 
items Zi = {xi,yi): (1) Independently, choose n integers Ni, . . . ,Nn uniformly 
at random from 1 to n. (2) Form the new data set: zj^j-^, zj^j^, . . . , -Zat^. 



It is sometimes stated j43| that bagging is approximately a non-parametric 
Bayesian procedure, but I think that this is a misleading claim. Bagging, or 
more specifically bootstrapping, approximates the behavior of a Bayesian who 
has a (limiting) Dirichlet prior (as in Rubin's Bayesian bootstrap [57[). This is 
not really a prior for several reasons. Besides the fact that this limiting prior 
is improper, the "prior" depends on the data. This is not just in a partial 
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sense (such as the prior I use in 



[chapter 1 in which the "prior" depends on the 



covariates but not the response) or even in the manner of empirical Bayes pro- 
cedures which choose some parameters of the "prior" by looking at the data. 
Indeed, this prior specifies the law of all possible data relative to the empiri- 
cal distribution. This might make sense if the data were multinomial, but for 
data on a continuous space it is quite problematic. If taken literally it speci- 
fies that all future data will consist of elements drawn from the current data 
set. Notably, for classification, this means that unless the data set happens to 
contain a head and a tail for each case, then the predictive distribution that 
the bootstrap prior corresponds to excludes the possibility that the missing 
flip will ever occur. Indeed, this bootstrap prior can never directly say any- 
thing about future datapoints whose covariates are new. Furthermore, there 
is no meaningful model involved. For example, the prior tt implicitly builds in 
information that says that if a certain region has more heads than tails, then 
future points in and around this region probably will as well. The bootstrap 
prior says nothing like that explicitly; if it says that in effect it is only because 
of the fitting method that is forced upon it. 

Finally, there is nothing Bayesian about using CART, so how can bagged 
CART be a non-parametric Bayesian procedure? Why would a Bayesian who 
believed in a bootstrap prior use CART or neural nets or whatever when he 
could easily compute his own (very bizarre) posterior and get conclusions di- 
rectly? Arguably, he might do so in order to get new information to guide his 
decision because he has observed that CART (say) has worked well on other 
problems so it probably will work on this one as well. In this sense, bagging 
could be said to model the behavior of a Bayesian who had a limiting Dirichlet 
prior (that magically was supported on the data set itself), who then com- 
puted the posterior of his "prior," and who then goes to seek the opinion of 
an "expert." Cleverly, he does so, not only for the dataset actually received, 
but for a multitude of datasets of size n that are about equally likely under 
his posterior. In this way, he finds out what CART would think in a variety 
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of situations that he subjectively considers as possibihties. So far so good. 
Now he ought to weight these opinions according to how credible they seem in 
light of the data and his prior opinion about when CART works and when it 
doesn't. Instead, he now effectively forgets his (Dirichlet) posterior and puts a 
fiat prior on the CART regression results themselves. With his "new/y found 
prior" he calculates the decision that minimizes squared error loss, the mean. 
So, in summary, bagging approximates the behavior of a forgetful Bayesian 
who looks at the data first, then formulates his "prior" and posterior, then ig- 
nores them, except to ask CART what it's opinion would be in the cases that 
he thinks are likely, then promptly forgets the data altogether as well as any 
priors or posteriors of his own that he might have held (recently), so he makes 
up a new uniform prior on all the results that he got back from CART; finally 
he computes the mean according to his latest prior and reports his "findings." 

In any case, however dubious as a "Bayesian" procedure, bagging works. 
One could say that this is because it reduces modeling bias or because it elimi- 
nates certain instabilities in CART: both of these arguments make perfect sense 
to Bayesians and frequentists alike. It is clear, however, that it's not always a 
good idea, especially when the procedure already has low "instability." In this 
case, bagging mostly adds noise and reduces the effective size of the dataset 
somewhat. 

I also found one example where bagging was disastrous. Following the 
bagging procedure strictly, I fed bootstrap resampled data sets into CART, 
but the result was a mess of indecipherable noise with splits everywhere. This 
was true even though the CART procedure gave a reasonable estimate on the 
original data set. Why? Because in the CART step I used cross-validation 
and this is problematic because CART is working on a bootstrapped sample. 
Certain repeated data points wind up in both the test and training sets. As a 
consequence the CART algorithm has less data that it thinks and also thinks 
that is not over-fitting when it uses a model with too many splits; the pruning 
procedures became completely ineffective. 
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Some authors argue that this problem is easy to avoid by simply not both- 
ering to cross-validate within CART, and instead using the full trees. This 
may be true in some cases, or if one implicitly uses an effective default prun- 
ing rule (and not actually full trees), but it failed on my example. To correct 
the cross validation, some authors "tell" CART which points are repeated so 
that the whole case is either left in or left out. Instead, for my experiments, 
I simply used a hand-tuned complexity parameter lower bound. For the dis- 
astrous experiment, the lower bound was set to 0; when increased to 0.005 
the estimates were still quite rough, but usable. For the experiments that I 
present in the next section, I set to the lower bound to 0.01, which (admit- 
tedly) is the default for the RPART implementation of CART. Still, unless 
this default is universally good, this leaves a tuning parameter to be set, and 
I shudder to think about the bizarre computations involved in choosing it by 
cross-validating bagged CART. 

For the experiments involving bagged CART, I used 100 bootstrap resam- 
ples. This is a larger number of bootstrap resamples than is generally con- 
sidered necessary for bagging. By informal Rao-Blackwell-type reasoning one 
would think that this only serves to reduce Monte Carlo error. 

4.1.3 Comparative Simulation Experiment 

This experiment involves 10 simulated data sets each containing 1024 data 
points that were created in the manner explained in the introductory chapter. 
In this and in future sections these will be referred to as experimental runs 
1-10. Briefly, one simply chooses 1024 random uniforms for the x-values, and 
then flips 1024 independent coins with the success probability of the coin being 
given by the function /q that is indicated by the blue curve in figures 14.1. at 
and l4.1.bl These figures also include a red and a green histogram (drawn upside 
down). These are histograms of the of the x- values for which the coin came out 
heads or tails respectively. There are 75 bins in each histogram, and to keep 
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Figure 4.1. a: Simulation Experiment: Run 1 The CART, Bagged CART, 
and posterior mean estimates are compared with the true regression curve on a 
simulated data set (a = |, n = 1024). CART misses the left hand change-point 
entirely on this example; bagged CART seems to edge out the posterior mean 
near 0.75 but they are quite similar over all. 

Key: True / (blue). Posterior Mean (magenta), CART (black). Bagged CART 
(cyan) 
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Figure 4.1.b: Simulation Experiment: Run 2 The CART, Bagged CART, 
and posterior mean estimates are compared with the true regression curve on 
a second (equivalently) simulated data set (a = = 1024). The posterior 
mean and bagged CART estimates track each other quite closely again. 
Key: True / (blue). Posterior Mean (magenta), CART (black). Bagged CART 
(cyan) 
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track of the counts, yellow lines are drawn for every 5 events. In these figures, 
the posterior mean estimate (a = |) is in magenta, the CART estimate is in 
black, and the bagged CART estimate is in cyan. The first two experiments are 



shown in a large format, and the latter 8 are grouped together in Figure 4.1.e 
To be specific, in these experiments to compute the CART and bagged 
CART estimates, I used the RPART library in S-Plus with default control set- 
tings, 10-fold cross validation, the 1-standard-deviation rule to control pruning, 
and the ANOVA method. I apply the RPART regression algorithms to binary- 
classification data by assigning the values 1.0 and 0.0 to the two categories. 
The complexity parameter was restricted to be at least 0.01 (this is the default 
behavior) . 

4.1.4 Observations 

Here is a summary of notable features in the results of the first two experiments. 
Some of these features are explained by the subsequent discussion. In exper- 
imental run 1 ( Figure 4.1. a] ), the CART estimate only has 4 steps, "leaving 



out" an important split on the left side. In experimental run 2 (Figure 4.1.b), 



CART has 5 splits in about the right places. In both figures, of course, CART 
retains its jagged appearance, but it does do an admirable job of finding the 
locations where the true curve has change-points. All three estimates share 
some features with CART, they all make a fairly abrupt change at essentially 
the same place, somewhere near the location of the true change at |. They 
all have substantially more trouble with the change at | and this makes sense 
because it is much easier to detect the difference between two coins with suc- 
cess probabilities 0.8 and 0.4 than between two coins with probabilities 0.6 and 
0.4. Surprisingly, at least in experimental run 1, CART seems to be a bit more 
similar to the posterior mean estimate than to its own bagged version. 

Comparing the posterior mean with bagged CART in the first figure, notice 
that bagged CART smoothes out some steps that the posterior mean leaves 
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in. Notably this happens near 0.75 where bagged CART comes closer to the 
truth. For some reason this does not happen near 0.8 where bagged CART 
is more blocky than the posterior mean. Bagged CART's smoothing was a 
disadvantage near |, though. Here the posterior mean makes a much sharper 
transition and also has an extra "blip" up in the correct direction. Looking at 
the plateaus, bagged CART has been pulled closer to ^ (away from the truth) 
than the posterior mean. This is probably due to averaging in a large number 
of CART trees that omit the left hand split. 

Comparing the posterior mean with bagged CART in the second figure, 
some of the features have remained, but not all. In experimental run 2, the 
posterior mean takes a much smoother descent on the right than it did before. 
In this case, the posterior mean is closer to the truth over all. In the first, 
bagged CART seemed to have an edge. The posterior mean has a blip near 0.1 
that bagged CART does not. 

4.1.5 Some Explanations 

Overall, though, in both experimental runs, the bagged CART estimate and 
the posterior mean estimate seem quite similar. Consequently, despite the 
very different way in which the estimates are arrived at, on this example at 
least the computations achieve a similar result. This is remarkable, especially 
considering that CART and bagged CART are the result of years of careful 
problem specific work and tweaking. The posterior mean estimate represent 
an enormous amount of work too, but most of the work is of a general nature 
(e.g. MCMC techniques) and not problem specific. Moreover, this prior is 
very naive (by design) there are many ways to modify it that would improve 
performance on this example by problem specific tweaking. For example, one 
could allow both linear regions and constant regions, or impose a suitable (but 
not too restrictive) dependency among the success probabilities that would help 
smooth the right side. It may also make sense to space out the locations of the 
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Figure 4.1.c: The Bagged Posterior Mean Estimate: Bagging the posterior 
mean estimate on the data from experimental run 1 produces the red curve 
which does not seem to change the posterior mean estimate much except to 
add in some irregular wiggles. 

Key: True / (blue), Bagged Posterior Mean (red), Posterior Mean (magenta). 
Bagged CART (cyan) Posterior Mean on a Fifteen Bootstrap Resamples (gray) 



splits explicitly. Rather, this prior is very flat. The choice of Geometric{\) as 
the hierarchy prior is not the result of years of experience, but is, in fact, the 
first thing I tried. The prior vr cannot yet be recommended in general, but that 
it even performs modestly well "out of the box" on this example is an excellent 
defense of the Bayesian approach. 

Why, is it, though, that the results are so similar? If the results were 
identical one could hope to prove a theorem about why this was so, but since 
they are only similar, and since cross-validated CART is not easily amenable 
to mathematical analysis (much less its bagged variant), I can only speculate 
that they are similar because they both average together roughly the same 
functions. They arrive at similar functions because, after all, they use the 
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same data set. Additionally, under ordinary circumstances there are bound to 
be similarities between the estimates that CART gives and the estimates that 
the posterior mean gives (even if I might argue that the posterior mean esti- 
mates are preferable). Consequently, if (ignoring decision theoretic discipline), 
I decided to bag the posterior mean estimate, it "follows" that the bagged pos- 
terior mean estimate should be close to the bagged CART estimate. Since I 
believe that the posterior mean estimate is fairly stable under subsampled data 
(stability under bootstrap resampling is perhaps more questionable, but both 
questions suggest interesting future research), I conclude that the posterior 
mean estimate should be close to the bagged CART estimate. If the reader is 
skeptical that the posterior mean has stability under subsampling, they may 
be interested in the examples given in lsection 4.8t these substantiate this claim 
(but do not address it specifically). 

The above argument is merely heuristic, of course, so it is reasonable to ask: 
what does happen if the posterior mean calculation is bagged? The answer 
is shown in [Figure 4.1.c| The gray curves show the results of computing the 
posterior mean on fifteen bootstrap-resampled data sets. The red curve is their 
average: the "bagged posterior mean." Looking at the gray curves, they have 
many wobbles and spikes, so it seems that the posterior mean is more sensitive 
to the repeated observations that occur in a bootstrap sample than CART is. 
As a result, the red bagged posterior mean curve is itself rather wobbly. 

4.1.6 Situations in which the Estimates Differ 

It is possible to construct examples where the posterior mean estimate would 
differ more dramatically from the CART and bagged CART estimates. One 
need only consider circumstances in which CART will reliably perform in a 
rather special way. 

For a first example, recall the CART works by pruning a full tree and that 
this tree is selected in a greedy manner. The first split that CART chooses. 
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for example, is usually a very important split, but there are certainly cases 
where choosing it in a greedy way is suboptimal if one considers the global 
search for the "best tree." To some extent bagging improves CART's ability to 
find useful trees because sometimes a resampled data set suggests a different 
splitting order. Considerations like this are not even an issue for the posterior 
proper (because it is a theoretical construct), but they are an issue for the 
actual estimates that get produced by MCMC. Still the issues are different and 
generally, MCMC methods will perform a much more "global" search over tree 
space than CART does. This search ability was emphasized as an advantage of 
Bayesian CART by 0|. In summary, then, in a circumstance where the greedy 
search has problems, the posterior mean estimate may avoid those problems, 
and, consequently, give a rather different answer than bagged CART. In my 
experiments this sort of thing showed up (to a small extent) when I increased 
the sample size to 8192. This experiment is discussed in Isection 471 The 
CART estimates preferred a split in the middle of the right-hand slope that 
the posterior mean avoided. 

A second example occurs if the x-axis is transformed. One-dimensional 
CART estimates are invariant with respect to this, while the posterior mean 
is not. In some senses this property is desirable; it seems "scientific." It is not 
always desirable though: suppose there is a large amount of data from to 0.1 
and from 0.9 to 1 but no data in between. Roughly, CART will treat this in 
the same fashion as if there were no gap; essentially it only looks at the ordered 
values. If it splits the gap at all, it will split in exactly the middle of the gap: 
between the rightmost of the left-hand data and the leftmost of the right-hand 
data (ignore the fact that this is not strictly invariant under transformations). 
This effect does not go away under bagging (although it might get smoothed 
a bit as the endpoints of the gap change) . 

However, the posterior mean estimates will be quite different. Notably, it 
will make a smooth transition from the regression value on the left to the value 
on the right. Additionally, because this gap is especially large and the prior 
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Figure 4.1.d: Transformed Data with a Gap: For this example, the data from 
experimental run 1 was transformed nonlinearly to create a gap. Notice the 
smooth transition that the posterior mean makes along the gap. 



specifies that split points are put down uniformly, it is quite likely that the gap 
will be split at least once or twice. If the gap is split twice or more, then the 
middle intervals will include no data at all. When there is no data (or even little 
data), the prior kicks in to specify that it thinks that the success probability 
is uniformly distributed from to 1 and consequently that the mean value of 
the success probability on this middle interval is |. Because cases like this get 
averaged in, the posterior mean should show some shrinkage to | on the gap. 
Shrinking to | is, of course, not always ideal, but in such a case one ought 
simple to modify the prior and/or loss function. To construct an example with 
a gap, I took the data from experimental run 1 and transformed it so that the 
left half of the data now lies in [0,0.1] and the right half lies in [0.9, 1]. The 
result of computing the posterior mean is shown in Figure 

These features make obvious intuitive sense. Furthermore, if a 95% subjec- 
tive confidence interval were formed by asking at each point x G [0, 1] for the 
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smallest interval containing 95% of the posterior mass, it would grow wider 
in the middle. In contrast, if naive confidence bands were formed around 
the CART or bagged CART estimates (by using bootstrapping perhaps) they 
would make rather little sense. 

4.1.7 Posterior Mean Behavior 

In the first experimental run, CART left out an "obvious" split, while in the 
second it put it in. Sometimes CART will also "add in" splits that it should 
not have; this is very sensitive to the particular data set and the pruning rules 
that are used. Bagged CART averages all of these together (in some special 
way that is hard to formulate, except algorithmically) to arrive at its smoother 
curve. 

In contrast, I imagine that the posterior mean is considering each of these 
possibilities and giving them an appropriate weight before averaging, in order 
to give its best estimate. Examining the posterior mean curve closely, one can 
see that wherever CART takes a step, the posterior mean also moves more 
abruptly than normal. Both estimates are, after all, both looking for steps, 
and the locations that CART chooses are bound to be special parts of the data 
set, often containing a run of heads on one side and a run of tails on the other. 
Surely this feature would stand out to both methods. 

The reverse is not true, however. Consider the "bump" in the posterior 
mean, just to the right of |. This results because the posterior considered 
functions with additional splits in this area and gave them weight. It is not, 
after all, the result of averaging a large quantity of individually pruned trees, 
but the result of averaging over all trees (in principle) with appropriate weights. 
CART trees, if allowed to have extra splits would have included this one as well. 
Along this line, a modest improvement to the bagged CART procedure might 
result if in addition to using different bootstrapped datasets, one sometimes 
used different pruning criteria as well, and then averaged the less penalized 
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trees in together with the more penahzed ones (with appropriate weights). 

As an aside, the posterior sometimes becomes more "sure" about the loca- 
tion of a spht than it "really ought to." This happens because of the mismatch 
between the truth and the prior (or perhaps more accurately, because of the 
mismatch is between tt and my own subjective prior). The posterior is doing 
the absolutely optimal thing if the prior tt is true (indeed, it is an admissible es- 
timator and there is no easy way to tell if the others are or not), but according 
the prior, functions like the one pictured in blue with a smooth transition are 
impossibly rare. When faced with data that could have resulted from a smooth 
transition, but might also credibly be created by a step function with two steps, 
the posterior only considers the latter possibility. This is especially apparent 
when there is a run of heads and then a run of tails occurs by chance (as it is 
bound to do from time to time). The posterior will concentrate more tightly 
around this cut-point than makes sense if one considers a smooth transition 
to be a credible alternative explanation. The posterior does, of course, allow 
for the possibility that there are multiple splits, but if the success probabilities 
on each side are reasonably similar, there may not be enough data to make 
this possibility stand out and the single split will remain the most prominent 
feature of the posterior mean. This results in a stair-step appearance that 
does not go away with larger sample sizes fsee lsection 4.8|) . although the steps 
tend to get smaller. Indeed, it appears that the stair-step shape grows more 
prominent for larger n. Perhaps this is because a larger data set also is more 
likely to have at least a few very long runs. 



4.1.8 Experimental Runs 3-10 and a Summary 

The results from experimental runs 3 through 10 are shown in Figure 4.1.e[ By 
and large, the same observations made before above apply to these examples. 
Bagged CART and the posterior mean track each other quite closely although 
each occasionally takes a "wobble" that the other does not. Broadly speaking 
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both estimators still retain visible traces of the CART-type functions that they 
are averaging together. In particular, both usually have some "stepiness" in 
their appearance; the averaging "softens" this, but does not eliminate it. This 
is especially visible in experimental run 10 in which both estimates also follow 
the CART estimate quite closely. CART leaves out the left hand split on 5 of 
the 10 experiments. 





CART 


Posterior 


Bagged 






Mean 


CART 


1 


0.0891 


0.0522 


0.0527 


2 


0.0776 


0.0478 


0.0568 


3 


0.0893 


0.0587 


0.0710 


4 


0.0996 


0.0660 


0.0737 


5 


0.1101 


0.0659 


0.0683 


6 


0.1130 


0.0609 


0.0799 


7 


0.0779 


0.0604 


0.0593 


8 


0.0907 


0.0669 


0.0617 


9 


0.1009 


0.0720 


0.0718 


10 


0.0671 


0.0531 


0.0587 


5 


0.0915 


0.0604 


0.0654 


a 


0.0147 


0.0076 


0.0088 



Table 4.1: Numerical Summary of £^-norm errors on the ten experimental runs 

A numerical summary of these ten experiments can be made by computing 
the £^-norm of the error between the estimated curve and the truth. This sum- 
mary is given in table and illustrated by the scatter-plot in Figure 4.1.f| 
The black points compare CART to the posterior mean. The cyan points 
compare bagged CART to the posterior mean. In each case, the x-axis is the 
£^-norm error of the posterior mean, and the y-axis is that of the competitor. 
Obviously, small numbers are preferred, and because the black points lie ex- 
clusively above the identity line on these ten experiments, the posterior mean 
is preferable here. The performance of the bagged CART and the posterior 
mean estimates is much closer, although the posterior mean's performance is 
slightly better on average. 
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Run 4 Run 8 





Run 6 Run 10 

Figure 4.1.e: Simulation Experiment: Runs 3-10 

Key: True / (blue), Posterior Mean (magenta), CART (black). Bagged CART 
(cyan) 
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Figure 4.1.f: Comparative Scatterplot: The results of the ten experimental 
runs are summarized by £^-norm error (also known as V MSB). On the x- 
axis is the error committed by the posterior mean estimate. On the |/-axis is 
the error committed by the CART (black) or bagged CART (cyan) estimates 
respectively. 
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Figure 4. 2. a: Three Smoothers: 

Key: True / (blue), Posterior Mean (magenta), Loess (black). Smoothing 
Spline (green), Kernel Smoother (red) 



4.2 Comparison with Other Popular Methods 

This section compares the posterior mean estimate with a variety popular tech- 
niques on the data from experimental run 1. Throughout this section, as usual, 
the true curve is plotted in blue and the posterior mean with GeometriciJ^) prior 
is plotted in magenta. 

4.2.1 Smoothers 

Figure I4.2.al shows the results from running three standard smoothers. The 
results are little surprise. The three smoother's estimates are quite similar 
over all on this example. They over-smooth the jumps, but partially make up 
for this by giving a smoother approximation on the smooth half. They also 
all take a turn at the ends; this is consistent with the data which happens to 
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Figure 4.2.b: A Lasso Estimate: 

Key: True / (blue), Posterior Mean (magenta), Lasso (Cp) (green) 

behave somewhat unusually there. 

Loess (plotted in black) fits a locally- weighted linear regression at each point 
to make its estimate. Smoothing splines (in green) use efficient computational 
tricks to compute the regression curve that optimizes a tradeoff between small 
MSE and small integrated second derivative. Gaussian kernel smoothers (plot- 
ted in red) take a weighted average of response values near a point to predict. 
For each method, I chose a smoothing parameter that seemed to give results 
that were about as good as possible. 



4.2.2 LARS/Lasso/Boosting 

The green curve in figure I4.2.bl shows the result of using a Lasso penalized 
regression. This was particularly easy to do using using the Least Angle Re- 
gression (LARS) software js^l- Like any hnear regression, the results depend 
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on what basis is chosen. For this example, the basis is constructed by con- 
sidering the function 1^; > ^ ioi 1200 values of c evenly spaced from to 1. 
To encode a datapoint with covariate x G [0, 1] under this basis, evaluate the 
1200 functions at x and pack the results into a vector: this vector becomes 
the covariate that the regression uses. Finally, Lasso regression resembles or- 
dinary regression except for one critical difference: the regression parameter f3 
is penalized by A||/5||i, where A is a tuning parameter. For this example, the 
parameter A was chosen using a Cp criterion. 

Constructed in this way, the Lasso regression estimate should be quite sim- 
ilar to the estimates that would be arrived at using other important techniques 
such as "Boosting stumps" and the least angle regression method. The simi- 
larity between these different methods is discussed in js^. 

As can be seen from the figure, the Lasso estimate is has some appealing 
features. It is piecewise constant, but it also takes a fairly large number of steps 
and spaces them out usefully along the smooth transition on the right side of 
the figure. It does a reasonable jump of "detecting" the two change-points. On 
the other hand, it does not go as low as it should from | to |, nor as high as 
it should to the right of |. 

4.2.3 Wavelets 

Figure I4.2.cl compares some estimates that were based on wavelet techniques. 
The red, black, and dotted curves show various wavelet reconstructions of the 
regression curve. Overall, I think the results are quite disappointing; artifacts 
from the particular basis used show through clearly into the estimate. Since 
the data are not regularly spaced, some accommodation is necessary to use 
conventional software (e.g. Wavelab). Algorithms exist that apply directly to 
irregularly spaced data, but I did not successfully locate any working imple- 
mentations. Instead, the dotted curve shows the wavelet reconstruction that 
results from simply using the ordered covariate values as if they were regularly 
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Figure 4.2.c: Wavelet Estimates: 

Key: True / (blue), Posterior Mean (magenta), Wavelet 1 (green). Wavelet 2 
(red). Wavelet 3 (black) 
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spaced and then extrapolating back to the irregularly spaced reality. 

There are a number of problems with this approach and some recent work 
has developed more sophisticated schemes. For the red and black curves, 1 tried 



one of the simplest 1J| which recommends using direct linear interpolation to 
produce values on a fine grid and then applying wavelet shrinkage methods 
to this gridded data. The estimates were computed by the wden function in 
Matlab. It has a large number of options, most of which (quite frankly) I 
do not understand. These include the choice of a threshold selection rule from 
among four options, the choice to use hard or soft thresholding, an option titled 
"multiplicative threshold scaling" with three options, the choice of the level at 
which to compute the coefficients, and finally the name of the wavelet family 
to use (many options). For someone with as little experience with wavelet 
methods as me, these options are not a feature but a drawback. Furthermore, 
experimenting with the different options, they all seemed to make a difference. 
A reasonable, but not heroic effort was made to choose working parameter 
values. In any case, for the illustrated curves, the Matlab commands that were 
used are: 

Xi=seq(min(X) ,max(X) ,2"12) ; % a fine grid of X-values 
Yi=interpl(X,Y,Xi) ; % on which to interpolate the response 

Yhatl = wden(Y, 'heursure ' , ' s ' , 'mln' , 8,'sym8'); % 1: green 
Yhat2 = wden(Yi, 'heursure' , 's' , 'one' ,10, Mb4' ); % 2: red 
YhatS = wden(Yi , 'heursure ' , ' s ' , ' one ' , 10 , 'haar ' ) ; % 3: black 



4.3 Comparison with Dyadic Prior 



For comparison. Figure 4. 3. a shows the result of computing the posterior mean 
resulting from the Diaconis and Freedman dyadic binary regression prior [2^. 
Like the prior vr studied in this thesis, this prior chooses a random partition and 
assigns independent success probabilities to each partition element. However, 
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Figure 4. 3. a: Dyadic Posterior: The posterior mean of the Diaconis and Freed- 
man dyadic prior on the data from experimental run 1 is drawn in green 



this prior uses a dyadic partition, sphtting the data into 2^^ equal pieces for 
some k > 0. The consistency of the resulting estimates is guaranteed for any 
choice of hierarchy prior, except perhaps when the true regression function is 
identically ^ and the data is pure noise. For this case, certain priors will be 
consistent and others will be inconsistent. It was of interest, then to try one 
on the boundary. For this reason the prior on hierarchy level K that was used 
assigns: P{K = k) = {1 - (3)(3^ for (3 = exp(-2-3) ^ 0.431. In fact, for this 
data, the results were stable over a wide range of choices of f]. 

Since the prior is dyadic, it has no trouble at all nailing the split at |; 
of course, it does not have such good luck for the split at |. The posterior 
strongly favors a model with around 8 steps. 
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4.4 Dependence on the Parameter of the Ge- 
ometric Prior 

In this section, consider a departure from the Geometric{\) hierarchy prior. 
Figure I4.4.al shows what the posterior mean estimate would be for the data 
from experimental run 1, if a Geometrical — a) prior is used on the number 
of steps K. As usual, the true response curve is indicated in blue, and the 
posterior mean estimate for a = | is drawn in magenta; the posterior means 
for other values of a are also drawn. As can be seen in the figure, as a ranges 
from small values (short tail prior, dotted black curve) to large values (long tail 
prior, solid black curve) there is not so much difference in the posterior mean 
estimate except that certain small bumps and wiggles that are suppressed 
for the small a values become visible for the larger values. Indirectly, this 
experiment also provides a check on the stability of the Monte Carlo estimates 
of the posterior mean; it is unlikely that there would be such close agreement 
among these independently computed estimates if the MCMC was not working 
reasonably well. To make a more detailed comparison, it would be sensible to 
pool the sampled regression curves and use an importance sampling technique 
to compute combined results. I do not pursue this here. 

It is also of interest to know how many splits the posterior is using. Fig- 
ure |43IE1 shows the posterior on the number of steps K for three Geometric{a) 
hierarchy priors: a = 0.1, 0.5, and 0.9 respectively. The difference between 
the priors has a more pronounced effect here. Under a Geometric{l — a) prior 
with a = 0.1, complex models are quite rare and consequently, the posterior 
on step size shown in the top panel has a fairly short tail. Notice, though, that 
even for this conservative model, the likelihood has been able to overwhelm 
the conservative prior enough to shift most of the posterior mass onto models 
with 5 splits. 

For the bottom panel, a = 0.99, which corresponds to a rather slowly 
decaying tail. Notice, though, that the tail of the posterior is not nearly this 




Figure 4. 4. a: The Posterior Mean for a variety of Geometric Priors: Consider the data of experiment 1 and 
compute the posterior under a Geometric{l — a) prior. oo 
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Figure 4.4.b: The Posterior on Model Size Consider the data of experiment 1 
and employ a Geometrical — a) prior. A histogram of the number of sampled 
regression functions that had k steps is shown for three values of a: 0.10 (top), 
0.50 (middle), 0.99 (bottom) 
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Figure 4. 5. a: A Marginal Likelihood Surface: A slice of the posterior for the 3 
change-point case {k = 4) is shown. One change-point is fixed at 0.491 (not 
shown) and the location of the other two vary along the x and y axes of plot 
respectively. 

long: models with a large number of steps k can fit the data well, but also have 
a large number of parameters: 2k — 1. This tends to down- weight them as a 
group. When the posterior is marginalized to yield the posterior distribution 
on the number of steps, an account is taken of "how many" of these more 
complicated models give a good fit as well as how good the fit itself is. Because 
of this trade of, models with 9 steps are the most common. 

The middle panel, corresponds to the Geometric{^) prior that has been the 
subject of so many experiments. Notice that models with fewer than 5 steps 
have almost no effect on the posterior mean; most of the mass is on models 
with 5 to 9 steps: 6 is the most common choice. 
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4.5 The Predictive Probability Surface 

To better understand the interaction of the data and the posterior on exper- 
imental run 1, reference to [Figure 4. 5 .a) is useful. It demonstrates just how 
spiky, multi-modal, and (in particular) non-normal the posterior's density can 
be. It shows a slice of the posterior for the 3 change-point case (i.e. four steps). 
One change-point is fixed at 0.491 (not shown) because this was the most likely 
location for a single split; this location accounts for the jump at ^ in /q. The 
other two change-points are allowed to range over the x and y axes of the plot. 
More technically, what is plotted is a self-normalized version of the function 
0(x), defined by [Equation 3.23| where x = (4, (0.491, x-coord, y-coord)). Essen- 
tially (f){x) computes the likelihood that the splits occur at a given location; 
it marginalizes out the different possible choices for the success probabilities. 
The height of the surface follows (f){x), and the color follows log(0(x)), so that 
the small-probability structure is also visible. The x and y axes are symmet- 
ric, of course, because splitting at a and b is the same as splitting at b and a. 
Similarly the function is largest along horizontal and vertical "bands." This is 
because when one split is in a particularly fortuitous place, it tends to improve 
the fit over all, even if the placement of the second split is suboptimal. The 
highest two peaks (near the opposite corners), represent splitting on the left 
(in the vicinity of |) to take care of the jump that is there, and on the right (in 
the vicinity of 0.75) to split the smooth transition region into its higher and 
lower halves. The secondary peaks near the far corner, represent splitting the 
smooth transition in two places and ignoring the left half (recall that this was 
the choice made by CART on this data). 



4.6 Behavior on a Small Data Set 



It is interesting to see how the posterior responds to individual data points; 
this is most easily seen in a very small dataset. In Figure 4. 6. a I consider a 
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Figure 4. 6. a: Small Data Set Experiment: On this five element data set, the 
posterior mean curve in magenta is the weighted (c.f. inset histogram) com- 
position of the colored curves. These curves represent the mean contribution 
from models with a fixed number of steps which ranges from 1 to 15 
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data set with 5 data points, that, as usual are shown by the two histograms. 
In this case there are two heads on the left and three tails on the right. The 
posterior mean (magenta) is seen to respond in a smooth way, except at the 
data points where it remains continuous but takes a (small) sharp turn. The 
other colored curves represent the posterior mean when the number of steps 
that the function is allowed to take is fixed a-priori. The flat black line, for 
example, is the posterior mean when no splits are allowed (one step). The 
red curve, on the other hand is the posterior mean if 14 splits (15 steps) are 
required. Notice how the red curve, especially, drifts back towards | for x- 
values that are not close to the data points. This happens because with 14 
splits and 5 data points, there are bound to be many empty intervals and 
those, necessarily, fall back on the prior. The posterior mean curve for a 
Geometric{^) prior on the number of steps is the weighted average of these 
curves, where the weights (as percentages) have been tabulated by the inset 
histogram. Since the 1-split model fits this data so especially well, it winds up 
contributing about as much as the contant model (which cannot be ruled out 
with so little data) to the final result. The contribution of the higher models is 
not forgotten, though; notice how the posterior mean (magenta) drifts back to 
I slightly on the left of 0.2 and the right of 0.8. Overall the posterior mean is 
conservative; it does not, for example, split the data in half at 0.5 and declare 
the left hand mean to be 1 and the right hand mean to be 0. 



4.7 Behavior on a Large Data Set 

The data from experimental run 1, consisted of 1024 (x, y) pairs with the x- 
values drawn uniformly from [0,1] and the y-values drawn BernouUi{fo{x)). 
This section answers the question: how does the posterior change if this data 
set is enlarged to have 8192 datapoints by generating additional data from this 
model? As usual the true curve is drawn in blue and the posterior mean for the 
Geometricij^) prior is drawn in magenta. The data set, as is visible from the 
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Figure 4. 7. a: Large Data Set Experiment: The data from experimental run 1 
is extended to a large data set of size 8192 

Key: True / (blue), Posterior Mean {a = |) (magenta), Posterior Mean (a = 
0.995) (green), CART (black), min-xerr CART (dashed black) 
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histograms, is getting rather large. Pleasantly, the posterior mean estimate is 
also giving an accurate estimate of the true curve. Somewhat disappointingly, 
though, the step functions on which the prior is based have not gone away. 
Although they are smaller, they are clearly visible in the magenta curve. 

The CART estimate for this data is shown in black. A couple of observations 
need to be made. First of all, because the 1-standard-deviation pruning rule 
was used, the CART curve only has 6 splits. Looking through the full tree, 
the model that minimizes the cross-validated error has two additional splits 
and is drawn by the dotted black line. This model agrees quite closely with 
the posterior mean estimate, but two of its splits were pruned away when the 
1-standard-deviation pruning rule was used because the standard deviation of 
the xerr is not small enough. It might be selected automatically if a more 
intensive cross-validation were used. Finally, note that both CART curves 
minor artifacts (when compared to the posterior mean) that result from the 
greedy nature of the full tree. 

One might suppose that with this much data, it ought to be possible to fit a 
model with many more steps. This does not seem to be true (at least for a prior 
that models the success probabilities independently). For example, if the full 
CART tree is manually pruned to have only one or two additional splits beyond 
the 8 that were used by the minimum xerr model, the additional splits visibly 
degrade the fit. To understand this, consider that the right half of the data 
should contain around 4096 points. By chance, they will not be (quite) evenly 
distributed over this half but for simplicity suppose that these "4096" points 
are divided evenly into the 6 intervals selected by the larger CART model. This 
leaves around 680 points in each partition cell. Recall that the variance of a 
Binomial{n, p) random variable is np{l —p), so that the standard deviation of 
the estimated success probability on each of these partition cells is going to be 
around 0.02. Considering that the regression estimator has to not only detect 
a difference, but also locate a good choice of split, and optimize the accuracy of 
the estimated success probability, it does not seem too unreasonable that the 
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average jump in success probability between neighboring cells is around 0.1. 

Also shown (by the green curve) is the result of computing the posterior 
mean when a Geometric{l — a) prior is used for a = 0.995. Interestingly, 
this long tail makes little difference and the green curve barely peaks out from 
under the magenta one. 



4.8 The Effect of Sample Size 

The previous section developed a extended version of experimental run 1 that 
contained 8192 data points. In figure I^^FIal this data is analyzed in more detail. 
Smaller data sets are formed by taking the first n data points for n ranging 
from 64 to 8192 by powers of two. It is very pleasant to see how the posterior 
mean incrementally grows closer to the truth. At first, the steps on the left are 
almost ignored (with so little data, any pattern they contain could have resulted 
from noise), but gradually they fill in. The transitions near the change-points 
in /o become very sharp and the estimates of the smooth transition steadily 
improve. 



4.9 The Effect of Sample Size: a Harder Ex- 
ample 

For the final example, I consider a much harder regression function. It is 
depicted in blue in [Figure 4. 9. a It was formed by taking multiple copies of 



the original /o and shrinking them to half their size repeatedly. A data set 
with 8192 point is simulated and the posterior mean is calculated for subsets 
of increasing size. As had been hoped, the features get filled in as the data size 
increases incrementally. The larger features rise above the noise first and then 
the smaller, so that for this regression function, the approximation seems to 
grow better as n increases on the right first, but then steadily spreads to the 
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Figure 4. 9. a: The Effect of Sample Size: a Harder Example The posterior 
mean is computed as sample size n runs through a wide range on this more 
challenging problem 
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left. Importantly, for the larger n, the posterior mean concentrates on models 
with many more splits than it did for the easier data since. Presumably this is 
because more complex models are necessary to increase the likelihood. To some 
extent this makes sense because even though this example is more complex than 
the previous one, some aspects of the regression function are relatively easy to 
detect (e.g. the large jump) and there are more of these features available for 
analysis. It is interesting to compare the results with the former dataset for 
n = 4096 with the results on the right half of the current dataset for n = 8192. 
They should be quite comparable because the regression function for this more 
complicated model on [|, 1] is just a rescaled version of the original model, and 
in both cases there should be about 4096 datapoints available. To appearances, 
the two results are quite similar except that the latter result does not do as well 
on the small interval from 0.5 to around 0.6. Perhaps it is being confused by 
the low success probability region immediately to the left of 0.5. Conducting 
a similar comparison between the original n = 1024 example the right half 
of the n = 2048 example, the latter result seems substantially inferior. On 
the other hand, it is not so bad when compared with the original result on 
experimental run 5. Furthermore, the performance of the posterior mean is 
quite good considering the overall increase in the difficulty of this problem. 
Even more amazing, considering the popular state-of-the-art methods, it does 
all of this without any tuning or cross-validation. 



Chapter 5 
Consistency 



This chapter estabhshes conditions under which the prior for one-dimensional 
classification that was introduced in Isection L2l is a consistent estimator of 
the true regression function. The consistency of the posterior is proven using 
a result by Barron, Schervish, and Wasserman ^. Before reviewing their 
theorem, I pause to introduce some notation. I also present a lemma that 
shows that the original conditions given in their theorem are equivalent to 
some others that may be easier to check. Finally, I specify the prior vr more 
formally and complete a proof of consistency. 



5.1 Notation and the Basic Theorem 

Let T be the class of all (Borel) measurable functions / : [0, 1] {0, 1}. Write 
/i for the uniform distribution on [0, 1] and i] for counting measure on the set 
{0, 1}. Write Z for the product space [0, 1] x {0, 1}, and call the product 
measure v. To any / G there is a corresponding density on Z with respect 
to V that we denote by /: 

/(x, y) = f{x)ly = 1 + (1 - f{x))ly = (5-1) 
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For notational convenience, we may write either f{x,y) for x G [0,1] and 
y G {0,1} or f{z) for z = {x,y) G Z. Write F = Ff for the distribution 
on Z whose density with respect to u is /. In words, the consequence of this 
construction is that samphng a point Z from F is the same as choosing an 
X uniformly on [0, 1] and then (conditionally on X = x) determining Y by 
flipping an "/(x)" coin. 

Write JF for the class of densities formed by considering / for every / G JF. 

Let d denote the Hellinger distance on JF: 



And let D denote the KuUback-Leibler discrepancy on JF (employing the 
usual convention that the integrand is interpreted as whenever f{z) = 0): 



We are concerned with posterior consistency and so the mass that the prior 
or posterior ascribes to certain small sets containing the true parameter /o is 
of interest. For any e > 0, we define two such "neighborhoods:" 



The "richness" of the parameter space is also an important quantity; to 
make this precise we supply the following definitions. 

Definition 1. Consider a class of functions A that are densities with respect 
to dominating measure u. We say that the collection of functions {f^ , . . . , f^} 
is a (5-upper bracketing of A if: 

1. for every a & A there exists i such that a < a.e. [v] 




(5.2) 




(5.3) 



i^, = {/G^|Z^(/o,/)<e} 
H, = {Je^\d0,J)<e} 



(5.4) 
(5.5) 
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2. every satisfies / {z)i'{dz) < 1 + 5 

Furthermore, write 7i(^, 5) for tlie (5-upper metric entropy of A wliich is tlie 
logaritlim of the size of the smallest possible (5-upper bracketing of A (infinity, 
if no finite bracketings exist). 

The following result can now be stated. It gives conditions on the prior 
that are sufficient to ensure that the posterior will concentrate on for any 



Theorem 4 (Barron, Schervish, and Wasserman 1999). Let u be a a- 

finite measure on a measurable space {Z,B), where the a -field B is separable. 
Let IT be a probability distribution on T , a class of probability densities with 
respect to v. Endow T with the Borel a-field induced by the Hellinger metric 
d. Let /o be a certain chosen density with respect to v and write Fq for the 
corresponding distribution on Z. Let Zi, Z2, . . . be drawn iid from Fq. In the 
notations explained above, further assume that for every e > 0: 

1. n{K,) > 

2. There exists a sequence {An}'^=i of measurable subsets of J-" and positive, 
real numbers di,d2, c, and 6 such that: 

(a) tt{A'^) < diexp{—d2n) for all n sufficiently large 

(b) H{An, S) < an for all n sufficiently large 



Then, with probability 1 [under F^ measure], Bayes theorem applies for all n; 
i.e. for any measurable B <Z T and any n: 



e > 0. 




(d) 6 < eV4 



n{B\zi, ...,Zn) 



UK=Ji^Mdf) 
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And for any e > 0; 



lim 7r{H^\zi, . . . , Zn) = I 



n- 



■oo 



The second condition of this theorem seems rather technical. To restate 
this theorem in simpler terms we prove an elementary lemma which shows 
that these conditions (item / in the lemma) can be expressed in three other 
equivalent forms. 

Lemma 5. The following conditions, given in the notation defined above, are 
equivalent: 

I For all e > 0, there exists {An\'^=i, a sequence of measurable subsets of 
T and positive, real numbers di,d2, c, and 6 such that: 



II There exists a sequence of positive real numbers {6^}'^i, with 5* J, 0, and 
{{-^nln^ili^i? ^ sequence of sequences of measurable subsets of such 
that: 

a for all i, limsup„ log(7r {{A^^)^)) /n < 
b limsupj limsup„ 7Y(^^, = 

/// For all e > there exists {An}'^=i, a sequence of measurable subsets of 
T and a positive, real number 5 < e such that: 

a limsup„ log(7r ((^n)"^)) /n < 
h limsup„ 7Y(v4„, 5)/n < e 



a rc^A'n) < di exp(— for all n sufficiently large 



b H{An, S) < an for all n sufficiently large 




d 6 < eV4 
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IV For all e > there exists {An}'^=i, a sequence of measurable subsets of 
T such that: 

a limsup„ log(7r {{AnY)) /n <0 
b limsup„ 7Y(^„, e)/n < e 

Proof I =^ II: 

First, notice that the condition from I. a that "there exist di > and 
(^2 > so that 7r(A^) < c/i exp(— ^2^^) for n sufficiently large" implies that 
limsup„log(7r(^J^))/n < — c/2 < 0. Conversely, if limsup„ log(7r(^^))/n = a < 
0, then for all sufficiently large n, log(7r(^^))/?T, < a/2 and vr(^^J < exp(|n) 
for all sufficiently large n. 

Choose a sequence of e* > 0, e* | and use / to establish the existence 
of {A^^} ., d\, d\, c\ 5^ satisfying / for e\ Since 5* < (e*)^/4 [ 0, we can find a 
subsequence on which 5* | 0. Without loss of generality, assume we already 
have such a subsequence. The above reasoning, then, establishes II. a for this 
sequence. Since La implies that limsup„7Y(^^, < c* for all z, to estab- 
lish Il.b, we need only show that limsupjC* = 0. Condition I.c constrains 
c* < ^(e* — v^)^ — /2. Viewing this as a function of 5*, notice that it is 
monotonically decreasing on [0, (e*)^/4] so that, necessarily, d < (e*)^/2, the 
value of this function at 5* = 0. 

// =^ I: 

Consider some e > 0. Note that ({e - \f¥f - S'^ /2 t as ? ^ 00. 
Find an i sufficiently large so that this expression is at least Then find 

a subsequent i sufficiently large so that limsup„7i(^^, (5*)/?7, < Choose 
c = and 6 = 6\ and / is proven. 

// <^=^ ///: This is a straightforward exercise in nitpicking. 

/// <^=^ IV: Observe from the definition that I-l{An, S) is a non-increasing 
function of 6. 

□ 
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Theorem 6 (Corollary to Barron, Schervish, and Wasserman 1999). 

Let u be a a -finite measure on a measurable space {Z,B), where the a -field 
B is separable. Let n be a probability distribution on T , a class of probability 
densities with respect to v. Endow T with the Borel a -field induced by the 
Hellinger metric d. Let Jq be a certain chosen density with respect to v and 
write Fq for the corresponding distribution on Z. Let Zi, Z2, . . . be drawn iid 
from Fq. In the notations explained above, further assume that for every e > 0; 

1 7i{K,) > 

There exists a sequence {An}'^=i of measurable subsets of T , so that: 

2 limsup^ log(7r {{An)")) /n<Q 

3 limsup^ H{An,e)/n < e 

Then, with probability 1 [under F^ measure], Bayes theorem applies for all n; 
i.e. for any measurable B d and any n: 



!b nr=i f{z^>{df) 



/^nr=i/(^o^(rf/) 

And for any e > 0; 

lim 'k{H^\zi, . . .,Zn) = 1 



In words, this is what we have required: 1) the prior must put positive mass 
on all Kullback-Leibler neighborhoods of /q. 2) We must be able to choose an 
increasing sequence of subsets of the parameter space so that the n'th of these 
sets captures all but exponentially much of the prior mass 3) This sequence 
must not grow in "complexity" too quickly. We conclude that the posterior 
will concentrate on the subset of the parameter space which is Hellinger close 
to the true parameter. 
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5.2 Specification of the Prior 

We describe vr as a distribution on JF by means of first describing 
metric prior vr'. Let k be a distribution on tlie positive integers. Let = 
{(fc,v,s) : V e [0, 1]'=-Ss e [0,1]''} and let 6 = U^^Gfc. Let vr' be the 
distribution on 6, that can be described by first picking K according to k; 
and then, conditional on K = k, picking a point 6* G O uniformly from O^. 
That is, V and s are chosen independently and uniformly from the appropriate 
unit cubes. Let V(j) denote the i'th ordered value of v. Now to any 6* G 
associate the function G given by the following construction. For k > 1, 

let Ji = [0,V(i)),J2 = [V(2), V(3)), . . . = [V(^._2),V(fc_i)),4 = [V(fc„i),l]. If 

k = 1, just let Ji = [0, 1]. Take fe{x) = J2^=i ^i'^x G /■• Filially, then, to draw 
/ from vr, draw 6 from vr', construct fg, and form fg as in [Equation 5.1 

5.3 A Consistency Proof 

This section proves that the prior tt just described is consistent in the sense 
that, under repeated sampling, the posterior mass will concentrate on a Hellinger 
neighborhood for any e > 0. The proof uses three lemmas to establish the 
conditions of Theorem El 

The first lemma shows that vr puts mass on all KuUback-Leibler neighbor- 
hoods K^. 

Lemma 7. Let vr be the prior distribution on T , the class of densities w.r.t v, 
that was described above. Assume that k, the hierarchy prior, assigns positive 
mass to every natural number. Let /o be an arbitrary density in T . Then, for 
any e> 0, 7r(A"J > 0. 

Proof. Let /o be the corresponding function in JF. For some < 5i < ^, to be 
determined later, let: 

g = {geJ^: ||(?-/o||i <25i and5i < < l-5i (Vx)} 
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Let Ag = {x e [0,1] : loix) - /o(x)| < 62} where 62 := VW- By 
Chebyshev's inequality, for any g E the Lebesgue measure of Ag'^ is smaller 
than 82- 

In other words, as shown in Figure 5. 3. at if x G Ag, this restricts the pair 
(/o(x), to lie in the convex set C (blue) whose extreme points are (0, 5i), 
((5i + (52, (1,1 — (1 — (5i — ^2, 1 — <^i)- For X ^ we still have a restriction 
on g{x), so that the pair (/o(x), (7(0;)) lies in the rectangle R (green) with 
vertices (0, 5i), (1, 5i), (1, 1 — 5i), (0, 1 — 5i). Let Di{p,p') denote the KuUback- 
Leibler discrepancy between the Bernoulli{p) and Bernoulli{p') distributions. 

p 1 — p 

Di{p,p') ■.= p\og— + (1 -p) log- 

p' I — p 
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Then for any g & Q we can bound D{fo,g), as follows: 
D{fo,g) = I D,{foix),g{x))+ j D,{f,{x), g{x)) (5.6) 
< / sup Di{a,b) + / sup Di{a,h) (5.7) 

JAg ia,b)eC JAg" {a,b)&R 

and because -Di(a, b) is convex in the pair (a, 6) the supremum is achieved 
at the vertices so that the above is bounded by: 



< max Di{a,b)+62 max Di{a,b) (5.8) 

{a,6)G{verticcs of C} (a,ti)s{vcrtices of R} 

Using the symmetry Di{a, b) = Di(l — a, 1 — 6) 



< max (Z}i(0, 5i), Di{6i + 62, 5i)) + 62 max (Di(0, 5i), Z}i(0, 1 - 61)) 

(5.9) 

= max|^-(l - 5i) log(l - 5i), 

(^f + 52) log(^) + (1 - 5f - 5.) log(^^)) 
+ 52(-log(5i)) 

(5.10) 

< max (- log(l - 5i), 2 log(2)52) + 52(- log(5i)) (5.11) 
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For small 6i, the last term is the most important. To simplify the first term, 
verify that for < 5 < |, - log(l - 6) < 6 + < ^6 

< max 05i,21og(2)52^ + S^i-logiS,)) (5.12) 

<^52 + 52(-log(5i)) (5.13) 
= V^(-log(5i) + ^) (5.14) 

Which tends to zero as 6i | 0. For a sufficiently small choice of 6i, then: 

g = {g : gEg}cK, = {fE^ : D{J,J) < e}. 

It remains to show that Q has positive prior mass. To do this, we first find 
a step function go ^ G which approximates /o and then show that Q*, a set of 
perturbations of go, remains in Q and that Q* has positive prior mass. 

Observe that since fo is Lebesgue-measurable there exist two increasing 
sequences of step functions hf and for which ||/o — {hf — h^)\\i ^ 0. This 
is, in fact, the basis of a common construction of the Lebesgue integral 
Consequently, we can find a step function h E for which ||/o — /^-lli < ^i/2. 
Let go be the function obtained by modifying h so that it always remains 
in [5i,l — i.e. go{,x) = max(min(/i(a;), 1 — 6i),6i), so that go ^ G and 
||/o-^o||i < 

Now, parameterize go in the manner of lsection 5.21 fchanging it at a set of 
measure if necessary) so that A; < cxo is the number of locally constant regions 
in go and the vectors v = {vi)^^l and s = (sj)f^]^ signify the change-points and 
success probabilities of go- 

If we then perturb each Si towards the value | by no more than (5i/4 we 
obtain a new function g' which remains in Q and satisfies \\go — g'\\i < 5i/4. 
If, in addition, we perturb the Wj's by no more than 6i/{A{k — 1)) we obtain g" 
which also remains in Q, satisfying \ \go — g"\\i < Si/2 or ||/o — g"\\i < 26i. 



42|. 
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Denote the class of functions thus obtained by Q*. Then n{Q) > 7t{Q*) > 
K{k}{Si/4f{6i/{2{k - 1)))^-^ > 0. □ 

Remark 1. A shorter proof can be based on the martingale sequence formed 
from conditional expectations of /q. 

For the second lemma, consider JF^, the class of functions which are con- 
stant except possibly for as many as m — 1 change-points. 

Definition 2. Using the notation from lsection 5.2| let J-'m = {fe '■ d G Om}- 

Let J^rn be the associated class of densities with respect to u: 

Tm = {J{x,y) = f{x)ly = 1 + (1 - f{x))ly = : / e J-„.} 

Lemma 8. The 6-upper bracketing entropy of the class Tm is no more than 
(2m- l)log([m/(5]). 

Proof. Fix a, b positive integers. Partition [0, 1] into a equal intervals Ii^ ... .,1a- 
Consider the class Hm-a,b of all functions that can be formed in the following 
way: Choose some m — 1 of these intervals. Let C be the union of the chosen 
intervals and let Ci, . . . , be the nonempty subintervals of [0, 1] formed by 
subtracting C {k < m). Finally choose Bi, . . . G {1, . . . , 6}. Construct the 
function f : [0, 1] x {0, 1} ^^ [0, 1] by: 



f{x,y)={ 



1 iixeC 

Bi/b ii X E Ci and y = 1 

1 - (5, - l)/b if x G Ci and y = 



It is easy to see that, by appropriate choices, for any / G J-'m we can find 
an G Hm;a,b that is greater than or equal to it globally. Furthermore, the 
integral of is less than or equal to 1 + 1/6 + (m— l)/a. The size of Hm-a,h 
is < {rn-il^^ — By choosing a = b = \m6~^~\, we have shown that 

exp(H(^^,5))<([mril)2™-i. □ 
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The final lemma establishes a result about the tail of Poisson random 
variables. 

Lemma 9. If K ~ Poisson{X) for some A > 0, then for any k > X: 

P{K>k)<e' ' 



k\ \k-\ 

And, consequently, for any < /5 < 1 there is a ko sufficiently large so that for 
every k > ko, P{K > k) < exp{—f3klog{k)). 



Proof. 



P{K>k)=e-'J2- 



i=k 



k\ 



i=k 





'k\k'-^' 




i\ 



Now, the bracketed term is no more than 1 and we are left with a geometric 
series whose factor, \/k is less than 1 by our assumption so that: 



P{K >k)<e 









[; 






k\ 


Kk 



So that for k sufficiently large, from Stirling's formula: 

log(P(J'S: >k))<- \og{k\) -X + k log(A) + log 
< -(3klog{k) 



k-\ 



□ 



The main result can now be established. 
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Theorem 10. Suppose that the prior ir, described in \section 5 /J is based on 
the hierarchy prior n. Suppose that k gives positive probability to every natural 
number and that its tail satisfies: ^{{k G N : k > j}) < exp(— /5j log(j)) 
for all j sufficiently large and some j3 > 0. Let /o be an arbitrary measur- 
able function from [0, 1] into [0, 1]. Suppose that Zi, Z2, . . . are drawn iid from 
the distribution Fq, which has density /o with respect to u. Equivalently, sup- 
pose that the Zi's (Zi = {Xi^Yi)) are drawn as follows: the Xi's are drawn 
independently and uniformly from [0,1]; and, conditional on Xj = Xi, Yi is 
an independent Bernoulli{fQ{xi)) random variable. Then, in the notation of 
\section 5.11 for any e > 0, 7r(ife|2i, • • • , ^n) —^1 as n ^ 00 [F^^-a.s.]. Specif- 
ically, for any 1 < p < 00, and any e > 0, the posterior mass on the set 
{f E J-" : 11/ — /ollp < e} tends to 1 as n ^ 00 [F^^-a.s.]. 

Proof. Let m{n) := [an/ \og{n)\. Choose the sequence {An} as An = J^m{n)- 
Then, by Lemma |H1 7Y(^„,e) < {2m{n) — 1) log([m(n)/e]). Observe that 
m{n) I 00 as 71 ^ 00. For n large enough, then: 

H{An,e) < 2m{n) log(ara/ log(ra)e~^ + 1) 

< 2ar;,/log(?7,)[log(n/ log(n)) + log(a) — log(e) + 1] 
log(n/ log(n)) 



< 3an- 



log(n) 



Choosing a = e/3, we have proven that hmsup7-^(v4„, e)/n < e. 
Now calculate that for all sufficiently large n, 

-log(7r(A')) = -log(K({A; G N : k > m{n) + 1})) 

> f3[m{n) + 1] log(m(ri) + 1) 

> j3[an/ log(n)] log(an/ log(?T,)) 
log(n) - log(log(n)) + log(a) 



= a(3n- 

> -adn 
- 2 ^ 



log(n) 



CHAPTER 5. CONSISTENCY 



105 



Consequently, limsuplog(7r(^„'^))/'n, < —^a(3 < 0. 

Finally, recall that Lemma [7| shows that for any e > 0, 7t{K^) > 0. Apply 
Theorem ini to complete the proof. 

The statement about || ■ ||p follows from the equivalence of the and 
Hellinger metrics for bounded densities. This fact and other useful inequalities 
about Hellinger distance d (aka Jeffrey's distance) are reviewed in 22, section 
5.8 and excercise 5.7] which states: 

d{f,gf<\\f-g\\i<2d{f,g) (5.15) 

Finally, the equivalence of the £^-norm and the norm for 1 < p < oo and 
for all densities uniformly bounded by a constant, is well known. □ 

Remark 2. Applying Lemma IHl shows that the preceding theorem applies to 
any hierarchy prior k, whose tail behaves like that of a Poisson{X), for any 
A > 0. The theorem does not apply to the case in which k, is Geometric. 

Remark 3. The restrictions on the tail of the prior occur because condition 
3 in Theorem IHl requires that the "sieve" sets An do not grow in "size" too 
quickly as n grows. The choice A = Tm{n) made in the proof of this theorem 
is (essentially) the fastest rate of growth that this situation permits (in the 
absence of better bracketing estimates). Accordingly, condition 2 of TheoremlHl 
requires that the tail of the prior drops off somewhat faster than a Geometric 
distribution. 

Remark 4. For further discussion of how to interpret this result, please see the 



discussion in chapter 6 



Chapter 6 

Discussion of Consistency 
Results 



It is challenging to suggest what the practical consequences (if any) of Theo- 
rem I lUI are. It proves that under the modeling set up with iid observations 
that has been considered throughout this thesis, that the posterior of the ran- 
dom split prior vr is a consistent estimate of any measurable true regression 
function if the tail of the prior decays at least as fast as exp(— /9j log(j)) for 
some (3 > 0. Perhaps it would be wise not to over interpret the result. After 
all, it only supplies a sufficient condition for consistency and does not either 
establish that Geometric priors lead to inconsistent estimators or that Poisson- 
like priors are a good (i.e. practical) idea. Additionally, it attempts to prove 
consistency in a fairly strong sense: that the posterior mass concentrates on 
Hellinger neighborhoods of the truth. Weaker consistency results, say that the 
posterior mean be consistent, might go through under milder assumptions. 
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6.1 Consideration of the Diaconis and Freed- 
man Results 

Judging from the results of Diaconis and Freedman for their prior, as discussed 
in lsubsection 2.2m it might, in fact, be the case that the posterior is consistent 
for any hierarchy prior k (with full support), so long as /o is not the constant 
function |. That is, it might be that the only situation in which it is inconsistent 
(even using a "poor" choice of k) is when there is no real pattern at all in 
the data because every coin flip was fair. Note that if this were our only 
concern, namely that the estimates might be inconsistent under some specific 
finite collection of possible scenarios, this could be easily addressed (albeit in a 
decidedly non-Bayesian way), by choosing a prior that puts point mass on the 
troublesome cases. Consistency for these exceptions would then be guaranteed 
by a suitable application of Doob's result [29|. Interestingly, it could destroy 
consistency for other cases. An example of such a mixture is given by Diaconis 
and Freedman [s^j- Of course, a true subjective Bayesian would never change 
his or her prior in this way. Rather, unless these cases actually are a subjective 
impossibility, a purist would merely see this as an explication of why their true 
prior (that they are perhaps still in the process of articulating) differs from the 
former one. 



6.2 An Experiment to Check a Worrisome Case 

It seems reasonable to try and test for the possibility of inconsistency when the 
true function /q = | by running a simulation experiment. To do this, generate 
an increasing sequence of data sets over a range of sizes that are drawn from 
the /o = I distribution. That is, since there is no true signal at all in the data, 
it will be interesting to see how the posterior mean responds. For a Poisson 
prior, the posterior mean should (at least eventually) settle down, but will 
this hold for a Geometric prior? Indeed, the results in figures IB". 2. a( and I6.2.bl 



CHAPTER 6. DISCUSSION OF CONSISTENCY RESULTS 



108 



indicate that both estimates correctly identify the null case, whether using a 
Poisson{5) prior or a Geometric{^) one respectively. A new feature of these 
figures is the small histogram included on each one. It is a histogram of the 
posterior on the number of steps K in the unknown function. The Poisson{5) 
prior starts out believing that there will be a good number of steps in the 



data. This is clearly reflected in the histograms in Figure 6. 2. a for n = 64 and 
n = 128. The posterior mean in these cases has more "wiggle" than for the 
Geometrici^) prior even though both estimates see the same data. For n = 512 
both estimates find that the posterior mean is roughly constant, but somewhat 
lower than |. Apparently, this is a real feature of the data and not an artifact 
of either prior. Eventually, for large n all of these minor considerations wash 
out and clear preference for very flat models has triumphed. Interestingly, from 
n = 512 or so onwards, looking at the histogram of K, the Geometric prior 
has become convinced that the model has only one split. The Poisson prior 
is only beginning to reach this level of certainty about the truth as n reaches 
8192. 



6.3 Theory versus Practice 

Nevertheless, if the user insists upon using an estimation procedure that has 
been proven to be consistent (say, by the preceding theorem), he or she still 
have a great deal of freedom. Roughly speaking, they can use any hierarchy 
prior they like on the first 100!!! natural numbers and append to it an arbitrary 
Poisson tail. It is hard to imagine that there would be any practical differ- 
ence between the estimators resulting from this "extended" prior and those 
resulting from the unmodified one, at least for realistic sample sizes. There 
certainly would not be any difference in practice, because in practice we only 
approximately compute the posterior anyway and no ordinary Markov chain 
Monte Carlo would ever be run long enough to notice the change. 
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6.4 Heuristics about Poisson and Geometric 
Priors 

Despite these concerns, I still (tentatively) advocate the use of a Geometric 
prior because of the following heuristic arguments: (a) it favors simple models 
(b) its tail does not drop off "too quickly" so that it will hopefully not require 
enormous amounts of evidence for the data to "overwhelm" the prior: by ruling 
out simple models in favor of better fitting models (c) its tail does drop steadily; 
hopefully this will protect us from "over-fitting" (d) if we consider the mode 
of the posterior (on a log scale), the geometric prior penalizes each additional 
split by a constant (log(a)) so that for the mode to shift to a model with an 
additional split we require a commensurate improvement in the log-likelihood 
of the data. 

If using a Poisson{X) prior, I would be concerned that I might have spec- 
ified a parameter A that was too small. If the true regression function were 
unexpectedly complicated, it might take a large amount of data to overwhelm 
the prior. Interestingly, if we attempt to remedy this by taking an exponential 
mixture of Poisson^s, we get back a Geometric prior. 

Another objection of mine is that (for modestly large A) the Poisson prior 
puts less mass on models with one region than on models with two regions. 
This makes sense if I actually expect that the regression function will be fairly 
complicated, but it violates my (frequentist) training to consider a complex 
model before "eliminating" the simpler one. 

As a compromise I would propose a prior on K that was Geometric until 
a certain point ko and then decays like a Poisson. On the other hand, if, in 
fact. Geometric priors prove reliable or even conservative, it might make sense 
to consider an even heavier tailed distribution. A modest proposal is to take a 
uniform mixture of Geometric{p) priors for p ranging from to 1. This results 
in a prior whose tail decays like 1/k, the mass ai K = k being [k{k + 1)]~^. 
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Figure 6. 5. a: The Result of Using a Poisson Prior: Consider again the data 
from experimental run 1, but apply a Poisson{X) prior with A = 1, 5, or 10 
Key: True / (blue), Poisson{l) prior (cyan), Poisson{5) prior (black), 
Poisson{10) prior (green) 



6.5 Conclusions 



None of these arguments is conclusive. In the absence of sound theoretical 
arguments it is perhaps best to rely on experimental evidence. From chapter 4 
there is a good bit of evidence that Geometric priors perform well. What do 
Poisson priors do when applied to these data sets? In |Figure 6.5.a| the posterior 
mean resulting from a Poisson prior for three values of A is plotted. The results 
are comparable to what happened as a was varied in Figure 4.4.a[ The prior 
with the shorter tail flattens out some bumps that the prior with the longer 
tail leaves in. 

Still, for some applications especially, the practical question remains: how 
to choose a (respectively A)? Experience in the problem domain is the only 
method I can readily propose. Alternatives, like using cross-validation or an 
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empirical Bayes approach remain attractive, but unproven. 



Chapter 7 



Extensions 



There are numerous ways in which to extend vr, but perhaps the most pressing 
is to extend tt to mult i- dimensional data sets Dn = {{Xi, Bi)}^^^ where the 
predictor Xj is in R"^. One route to extend tt to higher dimensional problems is 
to observe that basically, all we need to consider is a suitable way to partition 
the space randomly. If an interesting way to choose a partition at random is 
found, then describe a new prior by saying: draw a partition and give each 
region an independent uniform success probability. One natural way to ran- 
domly partition R"^ is to suppose that a certain number of generating points are 
drawn from a Poisson process with constant rate function A and to associate 
each point with its Voronoi (nearest neighbor) region. Alternatively, one could 
select a subset of the observed x-values at random and use their locations to 
determine a Voronoi partition. This alternative is, unfortunately, not a purely 
Bayesian proposal since the partitioning depends on the data set given. On the 
other hand, it only depends on the x- values of the data set, so that it remains 
a Bayesian procedure with respect to the response data (the y- values). 

To be specific, the prior I consider (call it vr*) can be described by the fol- 
lowing. Let Xi, . . . ,Xn denote the n observed values of the covariates in X and 
let p be a metric on X. Proper choice of p is essential to good performance 
in complex applications, but using Euclidean distance should suffice for simple 
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problems. To any subset , . . . , Xi^ of the full list associate the Voronoi par- 
tition of X. That is, say that a point x is in the Xj. cell if p{x, Xj .) < p{x, Xj' ) 
for j' G {1, . . . ,k}. For definiteness, in the case of ties say that x is in the cell 
occurring first in the original ordering of the x's. Consequently, every point 
X G A" is in exactly one cell. Put a prior on these partitions of the space X 
by putting a prior on the (finite) set of all possible (nonempty) subsets of the 
list xi, . . . , Xn. Say that the prior probability of a subset only depends on the 
size of the subset and that the probability of a given size k, is proportional 
to the probability that a Geometric{l — a) random variable takes the value 
k. Finally, having chosen a subset at random, and consequently having fixed 
a partition of X , assign to each element of the partition a success probability 
Si drawn uniformly at random from [0, 1]. The generates a multi-dimensional 
regression function / : A* i-^ [0, 1] at random. 

To explore these ideas I simulated a two dimensional data set with 250 



data points, illustrated in Figure 7.0. a The x-values were chosen randomly 
(albeit not uniformly) from the illustrated rectangle; the ?/- values were drawn as 
independent Bernoulli random variables whose success probability is indicated 
by the gray-scale in the figure. Points in the whiter regions have a higher 
chance of being an "x," (i.e. y = 1) while points in the darker regions have 
a higher chance of being an "o" (i.e. y = 0). Jointly, the x and y data was 
actually generated in my simulation in the reverse manner: a fair coin was 
flipped to determine if y will be 1 or and then (conditionally) an x-value was 
chosen. Suppose y came up as 1, then with probability 1/2, x will be drawn 
uniformly from the square on the right; otherwise, with probability 1/2 it will 
be drawn from a bivariate normal distribution with standard deviation 0.1 that 
is centered on the left-hand square. If y came up a 0, the situation for x would 
be reversed. These two descriptions are essentially equivalent and the goal for 
the posterior mean estimate is always the same: to estimate the conditional 
probability of ?/ to be 1 given x; i.e. to estimate the gray-scale image. 

Sampling from the posterior of this prior is (theoretically at least) quite 
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Figure 7.0. a: A Two Dimensional Data Set and Target Function: The gray- 
scale on this figure depicts a certain regression function /q on a rectangle. 250 
data points are drawn by fiipping an /q (a;)-coin at the points indicated. Heads 
(red) tend to result when /q is large (white). Tails (blue) tend to result when 
/o is small (black) 
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simple. As before in [chapter 3[ the success probabihties can be integrated out 
analytically so that the posterior probability of a particular (nonempty) subset 
of size k is proportional to: 



Where A^^^^ and Nj denote the number of ?/- values equal to 1 or 0, respec- 
tively, on partition element j. Conditionally, the posterior distribution of a 
certain success probability Sj is Beta{Nj + l,Nj + 1). Since the number of 
possible subsets is finite and all of them (except the empty subset) have positive 
posterior probability, a standard Metropolis-Hastings type MCMC allows us 
to sample from the posterior (at least in theory). In practice, though, the rate 
of mixing matters; in an effort to improve this I have conducted preliminary 
work that employs the simulated tempering technique developed by Geyer and 
Thompson j39|. 

All that remains to be specified is a transitive random-walk on (nonempty) 
subsets of a set of n elements which has a known stationary distribution. This 
is easily done. Identify the class of all subsets of a set of size n with the class 
of binary vector of length n, with each coordinate indicating the presence or 
absence of a given element. Exclude the 0- vector from this set. Consider the 
random walk that picks a number J randomly from 1 to n and then proposes 
flipping the J'th bit. The proposal is not allowed if it would create the 0- vector; 
hold in this case. This Markov chain is easily seen to sample uniformly from 
the class of all non-empty subsets and consequently it is easy to modify it with 
the Metropolis-Hastings ratio in order to sample from the posterior. 

I have also proposed extensions of this technique to put a prior on smooth 
functions; under this proposal the prior concentrates on functions which softly 
partition the space using weighted Voronoi regions, but provide for smooth 
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Sample 3 Sample 6 



Figure T.O.b: Modal Samples: Voronoi PosteriorThe pictured samples were the 
ones most frequently occurring in a sample drawn from the Voronoi posterior 
TT*. Each sample is equivalent to a certain subset of the original covariate list. 
The included points determine the partition and are drawn with a green circle. 
The gray-scale on a given partition element represents the posterior mean of 
the corresponding success probability parameter. Notice how parsimoniously 
the circled points determine regions that isolate out the two clumps of data. 
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temperedct=13922, alpha=0.5, N=250 
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Figure 7.0. c: A Bivariate Example. 250 red and blue markers were put down 
randomly as shown. The posterior- mean estimate of /o under tt*, i.e. the 
estimated conditional probability of red at each position, is shown in gray; 
notice how the modal samples from Figure 7.0.b are incorporated into the 
posterior. 
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Figure T.O.d: Weighted Voronoi Posterior: Redefining the Voronoi cells to use 
random weights allows for elliptical arcs and lines to be used in the partition 
and eliminates some artifacts. 
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transitions between regions. 

A simple extension of this technique that removes many of the artifacts 
that are otherwise present because of the dependence on the particular loca- 
tions of the covariates can be made by using a randomly weighted Voronoi 
partitions js^. To make such a prior, simply augment each cell with a ran- 
dom variable Wj which is an independent r(7,7~^) a priori. Then, rede- 
fine the Voronoi partition so that cells with higher weight tend to be bigger. 
Specifically, consider the point x to be in cell j rather than cell k whenever: 
p{x,Xj)/wj < p{x,Xk) /wk- Intuitively, if an ordinary Voronoi partition can 
be understood by supposing that "crystals" grow out radially from the gener- 
ating seeds until they hit other growing crystals, then this weighted Voronoi 
partition allows for the crystals to grow at different rates. It is also simple 
to allow the crystals to start growing at different times (by simply adding a 
different random offset to p for each cell). The overall scale of the prior on W 
is irrelevant so this parameterization sets the mean to 1. For the experiment 



shown in Figure 7.0.d I used 7 = 5. As can be seen, this simple modification 
allows the posterior to choose neat balls and lines with which to isolate out 
the different contours of the data. The posterior is computed using standard 
MCMC techniques. 

Another route to extend tt is to utilize the observed connection between tt 
and ttdf (for details, please refer to lsubsection 2.2TTI . To make that connection, 
I employed binary random variables rii{u) that indicated if a test point u was 
or was not above a certain random threshold Vi. To generalize, then, we can 
take rii{x) to indicate if x is in a certain random half-space Hi. This would be 
similar to a version of CART in which we split first into two halves via Hi, 
and then split each half via H2, and so on. We could make something close to 
ordinary CART if we utilized random coordinate aligned half-spaces and if we 
employed a suitably "regularized" ttdf which did not assign an independent 
uniform to all possible 2^ binary /c-tuples at level k. More generally, one can 
invent other ways of "regularizing" vtdf so that numerous binary tuples are 
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Figure 7.0.e: Bagged CART in 2d: Running bagged CART on this two dimen- 
sional data yields a reasonably good estimate with an interesting horizontal 
and vertical blurring pattern 



tied together. 



Finally, for comparison, Figure 7.0.e the result of bagging ordinary two- 
dimensional CART on this same data is shown. It has a clear advantage on 
the vertical split down the center that happens to be coordinate aligned. It 
does a decent job of isolating the two clumps of data. Interestingly, there 
is a clear horizontal and vertical blurring pattern that arises from the use of 
partitions that only partially isolated the clumped data: it is isolated in one 
coordinate, but not in the other. 



Chapter 8 



Afterword 



As statisticians, we analyze data, formulate models, estimate parameters, and 
use these models to form predictions about future data. Broadly speaking, 
then, our business is inference. We go to a lot of trouble to formulate good 
models and we have spent a great deal of effort debating about the details 
of how to make inference within a model - e.g. Bayesian versus frequentist 
inference. 

Generally, though, the way we select our model remains an art. "Non- 
parametric" methods are a step forward here, because, generally speaking, they 
at least prescribe how to select a variety of "smoothing-parameters" which 
essentially determine which model among some class of models we actually 
apply. The main topic of this thesis is of this sort: A Bayesian approach 
to the question of how to estimate the number and location of change-points 
or, more generally, how to choose a partition of the data into approximately 
exchangeable subsets. 

In this afterword, I would like to step back from the details of this subject 
and address the more general problem of how we formulate a model. Some- 
times, we prefer to shunt responsibility for this and appeal to the scientist 
for help; but, fundamentally, all inference comes back to data eventually - 
how else did the scientists discover their model? Considering, then, that the 
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formulation of a model is essentially "an art," and that we know that the re- 
sults of our analyses are not absolute, but relative to the modeling choices we 
have made, how can we be so bold as to expect that reality will conform to 
our model-specific "confidence intervals?" This is not to say that statistical 
practice does not work, or that the formal properties of our analyses under 
our stated assumptions are invalid, but simply to remind the reader that the 
thread that connects the prescribed model with reality has no formal basis. 

In practice, a good rule of thumb is to consider several models and, con- 
sciously or not, select one which is simple and which we expect will fit the data 
at least nearly as well as other more complicated models that we might prefer 
not to have to consider. In practice, we look at the data ourselves before se- 
lecting a model and build in any particular types of regularity that we happen 
to notice that it possesses into our model - at least if we think it will affect 
our conclusions. 

Bayesian analysts do not escape these problems. They may subjectively 
allow for a mixture of several different models and for a range of "hyper- 
parameters" but this only ameliorates the core problem because no-one ever 
accounts for all the possible regularities that might be found. 

To further make my point, consider the following, admittedly fanciful, 
thought experiment. An alien race comes to earth and challenges mankind to 
an intelligence test. We are given a binary time-series to analyze one- hundred 
bits at a time. It begins innocuously enough: 

01000001000000100000110111101110000000100101000011 
10101010000000001110010010000000001011001101101010 

What statistical models shall we consider? We could try iid Bernoulli, or 
perhaps a hidden Markov model. If ambitious, we might let the data choose the 
order of our model. Upon doing so, we find that the null model fits best, despite 
the superficial appearance of runs, and we model the data as random coins with 
success probabihty 0.35. The ahens ask us to give a confidence interval on the 
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fraction of I's in the next 100 bits and to estimate the probabihty that the 
50'th bit of this new data will be 1. Most statisticians fall back on the classics 
and use something approximately like 0.35 ±0.1 for both answers. Others who 
used a richer model suggest wider intervals and the ordinary debate ensues. 

So far so good, right? Or should we worry that our glance at the data and 
default choice of classical model may miss structure that we failed to notice? 
Naah! The aliens couldn't be that tricky.... Then the next 100 bits arrive. 

00000000000000000000000000000000000000000000000000 
00000000000000000000000000000000000000000000000000 

A dramatic failure, but no problem, the advocates of the HMM model were 
ready for this sort of thing. We have, they argue, simply encountered a hidden 
state which always produces 0. This explains the data well enough, but some 
remain skeptical. They propose that the character of the data may change 
with every new segment of 100 bits so that the effective size of our data set is 
only 2. The next 100 bits arrive. 

00011100011100011100011100011100011100011100011100 
01110001110001110001110001110001110001110001110001 

And again we are surprised. Some HMM advocates insist we just need to 
add a number of new hidden states. Others extend the HMM model to favor 
this sort of cycle-like behavior. The next 100 bits arrive. 

00100100001111110110101010001000100001011010001100 
00100011010011000100110001100110001010001011100000 

And most are satisfied that the data has returned to iid, but with the 
world attention that this situation has generated someone notices that these 
are, in fact, the first 100 bits in the expansion for the fractional part of tt. (Cf. 
http://www.algonet.se/~eliasb/pi/binpi.html) Oh dear - we certainly 
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hadn't planned on this - but we ignore this regularity in the data at our peril. 
It couldn't happen by chance, could it? 

As the test continues, we continue to be surprised by the patterns we are 
sent. By now, we have learned a lot: we know less than we think about 
the future. Every 100 bits we have been presented with a pattern that we 
hadn't expected. Finally, we are given sequence after sequence that we can't 
explain. Eventually the aliens conclude that, however feebly, we are, at least, 
a modestly intelligent form of life; and, taking pity on us, they decide to 
reveal the patterns we hadn't discovered. The last sequences, for example, 
were actually Shakespeare, encoded by simple alien cipher that no human had 
ever considered. Furthermore, the first sequence wasn't actually "random:" 
to generate it, all we had to do was start matlab or an equivalent (alien) 
computational program and type: 

x=rand(l,100)<=0.397; 
sprintf Cy.c' ,x+'0') 

Perhaps you object to my example. You prefer regression to time-series 
analysis and are content to consider data for which no-one would object to 
the model that the responses are independent given the predictors. Perhaps 
the problem of extrapolating a non- stationary time-series seems far too lofty 
to you. But you haven't escaped it by wishing it away; in fact, the time-series 
problem can be embedded in the regression problem. We need only suppose 
that the covariates are tested one by one in some fixed designed fashion so 
that we see the responses sequentially. If we know that the regression function 
"smoothly" depends on the covariates, present methods can be expected to 
work; but, if the dependence is sufficiently complicated, each new data point 
tells us something entirely new, just like in my fictitious time-series. 

Even if the data is generated iid - the regression case considered in most 
of my theoretical work - there is plenty of room for improvement. In this 
situation, we are, indeed, much better off - we can make rigorous probability 
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statements about the quality of our predictions on average. Even so, as more 
and more data come in, and the general shape of the regression function be- 
comes more tightly resolved, the "knowledge we gain" itself comes to us in a 
time-series fashion. Because of this, we cannot easily make predictions about 
the regression function at some fixed point. 

For example, suppose the unit interval were divided into subintervals of size 
J, |, etc... and suppose that the regression function takes a different value on 
each piece. In this way, we would quickly have enough data to estimate large- 
scale features, like the value of the regression function on the larger pieces, but 
if we are asked to make a prediction on one of the very small pieces, what are we 
to do? Perhaps, if we were smart enough, we wouldn't blithely approximate the 
regression function as "smooth," we would look for patterns in the regression 
values on the large pieces to help extrapolate to the smaller pieces. If, in fact, 
the regression values were seen to alternate between high and low, we would 
be silly to treat the function as "smooth." Instead, let us hope that were are 
lucky enough that it is quite "regular." 

What, then are we to do? Consider again the alien's test, and the complex 
sequence of modeling decisions that we needed to make along the way. How 
can we summarize the thought process that we went through as surprising data 
continued to come in? What possible prior on models could our analysis (even 
approximately) conform to? Our only recourse, it seems, is to formalize the 
idea of regularity and make explicit the manner in which we choose a model 
to accord with the regularities apparent in the data. 

A reasonable defining property of regularity is that a distribution is regu- 
lar if it can be (approximately) reproduced within a certain budget of time 
by applying some modestly short computer program to the data that we have 
previously seen and a "random sequence." Roughly speaking, then, we can 
put a prior on models and/or regularities by putting a suitable prior on com- 
puter programs. Alternatively, we can select among the computer programs 
in a manner conforming more closely to the "method of maximum likelihood." 
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Formal versions of these ideas have been proposed 
though much work is needed to formulate methods that are ready for actual 
use. Still, it seems to this author that the further development of some version 
of these ideas is an essential, natural, and unavoidable step in the progression 
of statistical thinking. 
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